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Abstract 

In this thesis we study adaptive nonparametric regression with noise misspecifi- 
cation and the complexity of approximation of random fields in dependence of the 
dimension. 

First, we consider the problem of pointwise estimation in nonparametric regression 
with heteroscedastic additive Gaussian noise. We use the method of local approxi- 
mation applying the Lepski method for selecting one estimate from the set of linear 
estimates obtained by the different degrees of localization. This approach is combined 
with the "propagation conditions" on the choice of critical values of the procedure, 
as suggested recently by Spokoiny and Vial [66]. The "propagation conditions" are 
relaxed for the model with misspecified covariance structure. Specifically, the model 
with unknown mean and variance is approximated by the one with the parametric 
assumption of local linearity of the mean function and with an incorrectly specified 
covariance matrix. We show that this procedure allows a misspecification of the co- 
variance matrix with a relative error up to o(j^^) , where n is the sample size. The 
quality of estimation is measured in terms of nonasymptotic "oracle" risk bounds. 

We then turn to the e -approximation of c? -parametric random fields of tensor 
product-type by means of n-term partial sums of the Karhunen-Loeve expansion. 
The analysis is restricted to the average case setting. The quantity of interest is 
the information complexity n{e, d) describing the minimal number of terms in the 
partial sums, which guarantees an error not exceeding a given level e . The behavior 
of d) as d — > oo is the subject of our study. It was shown by Lifshits and 
Tulyakova [H] that this problem inherits the curse of dimensionality (intractabil- 
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ity) phenomenon. We present the exact asymptotic expression for the information 
complexity n{£,d) . 



V 



Acknowledgements 



It is a great pleasure to write this page. I would like to thank my supervisor 
Vladimir Spokoinyi for introducting me to the challenging world of adaptive meth- 
ods. I am deeply grateful to my friends and colleagues: Gilles Blanchard, Rada 
Dakovic (Matic), Le-Minh Ho, Anastasia and Vladislav Kolodko, Nicole Kramer, 
Volker Kratschmer, Anna Martins (Levina), John G. M. Schoenmakers and Nataliya 
Togobytska for their sympathy and help. I am greatly indebted to Alexandre B. Tsy- 
bakov for his support, important comments and constructive criticism. Many thanks 
go to Andre Beinrucker and Peter Mathe for the careful translation of the abstract 
into German and for valuable comments. I thank the secretary of our research group 
Cristine Schneider for her help in thousands of administrative problems, and for be- 
ing always so friendly and nice. I wish to thank the Weierstrass Institute for Applied 
Analysis and Stochastics (WIAS) which made the completion of this thesis possible. 
Special thanks are due to the WIAS library and especially to Ulrike Hintze and Ilka 
Kleinod. 

The last chapter of this thesis was partially written while the author was visiting 
the Institut fiir Matematische Stochastik, Georg-August-Universitat, Gottingen, and 
was supported by the grants RFBR 05-01-00911 and RFBR-DFG 04-01-04000. I 
am thankful to the supervisor of this part of the thesis Mikhail A. Lifshits for the 
formulation of the problem, and to Manfred Denker for his support and for providing 
excellent working conditions. 

vi 



Contents 



1 Introduction 1 

1.1 Nonparametric versus parametric metliods 1 

1.2 Local approximation 6 

1.2.1 Local polynomial estimators: basic properties J6 

1.2.2 Mean squared error of local polynomial estimators 

1.2.3 Method of local approximation: general set-up 

1.3 Information-based complexity and 
approximation in increasing dimension 

2 Adaptive estimation under noise misspecification in regression 

2.1 Model and set-up 

2.2 Quasi-maximum local likelihood estimation 

2.3 Adaptive procedure 

2.3.1 Algorithm 

2.3.2 Choice of the critical values 

2.4 Theoretical study 

2.4.1 Local parametric risk bounds 



vn 



2.4.2 Upper bound for the critical values |36 

2.4.3 Quality of estimation in the nearly parametric case: 

small modeling bias and propagation property 44 

2.4.4 Quality of estimation in the nonparametric case: the oracle result |52 

2.4.5 Oracle risk bounds for estimators of the regression function and 

its derivatives 53 

2.5 Rates of convergence 62 

2.5.1 Minimax rate of spatially adaptive local polynomial estimators 62 

2.5.2 SMB, the bias-variance trade-off and the rate of convergence . 70 

2.6 Auxiliary results |74 

3 Dependence on the dimension for complexity of approximation 

of random fields |81 

3.1 Introduction and set-up 82 

3.2 Main result: the exact intractability rate in increasing dimension ... 87 

3.3 Proof of the main result 88 

3.3.1 Nonlattice case 91 

3.3.2 Lattice case 93 

3.4 Appendix. Examples of tensor product-type random fields 96 

3.4.1 Wiener- Ghent sov random field 97 

3.4.2 Completely tucked Brownian sheet 97 

3.4.3 Centered Gaussian processes 98 

3.4.4 Multivariate Anderson-Darling processes |99 



Bibliography 



101 



vm 



Chapter 1 



Introduction 

1.1 Nonparametric versus parametric methods 

In nonparametric estimation the balance between the approximation error (bias) and 
the variance of the estimator, the so-called bias-variance trade-off, plays a key role. 
The bias part depends on the regularity properties of the unobserved signal. Of- 
ten, for example in image denoising, see [32] and the references therein, this signal 
has spatially inhomogeneous smoothness. This prompts the idea to adapt statistical 
methods to the spatially varying smoothness of the function to be recovered from the 
noisy data. 

On the other side, there exists the powerful classical theory of parametric esti- 
mation, see [26], where the underlying data distribution IP belongs to a parametric 
family V = {Pe,^ G ©) described by a finite- dimensional parameter 6 E Q gMP . 
Obviously, the assumption that the parametric model holds globally, i.e., that there 
exists a parameter do G Q such that IP = Fg^ , is too restrictive. It is hopeless to 
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believe that the real data indeed follow some parametric model or even can be well 
approximated by it globally. 

One way out of this situation is to increase the number of parameters of the 
model, increasing the dimension of the parameter set . This increases dramatically 
the complexity of the model and may, especially for high- dimensional data, make the 
problem computationally unfeasible. See Chapter |3] for an example of a such problem. 
One can also approximate a high- or infinite-dimensional parameter set 6 by a dense 
sequence of low- dimensional subsets "sieves" {Op} , p = 1,2, . . . . See [73j for details. 
The simplest example of sieves is given by projection estimators, when the signal / 
is considered as a series expansion with respect to some functional basis. One tries to 
approximate / by the finite sums of this expansion, that is by its projection on the 
linear span of the first basis functions, see [72]. The crucial problem is to decide 
how large should be in order to provide a satisfactory level of approximation error. 
Chapter [3] of this thesis addresses to the problem of approximation of random fields of 
specific "tensor-product" type by the finite sums of the Karhunen-Loeve expansion. 

Another idea to make a parametric model more fiexible is to fix a small number 
of parameters, that is, the dimension p of the set , but to reduce the amount of 
the data. This leads to the local parametric approach dating back to the book by 
Katkovnik [21] and papers [21], [SO], where he suggested the method of local approxi- 
mation. This approach was further developed with application to image denoising, see 
[32], [20] and the references therein. For local polynomial fitting see [17]. An interest- 
ing development in the direction of local-likelihood estimation, closely connected with 
the ideas of [2] and [75] , is due to Loader [15] , Polzehl and Spokoiny [55] , Belomestny 
and Spokoiny [6] . A fruitful application of this approach is to change-point detection 
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in time series, see [63], [65], [TT] . 

In order to compare the method of local approximation with the projection es- 
timation described above, let us consider the following example. Fix a reference 
point X & M. By the Taylor theorem any function which is p times differen- 
tiable on the closed interval [x — h,x + h] and p + 1 times differentiable on the 
open interval {x — h,x + h) can be expanded with respect to the polynomial basis 
f{t) ^ fp{t) = f{x) + f'{x){t -x) + --- + fp\t - x)'P/p\ for any t G (x - /i, X + /i) . 
Here N = p is fixed; we aim to choose the width of the interval h by the data. If 
the bandwidth h is sufficiently small, the class of such functions is large and fp{t) 
can serve as a reasonable estimator of the value of the unknown signal f{t) for t 
close to X . This idea leads to the method of local approximation, see Section 11.21 
for details. Due to the dependence on x this approach is nonparametric or local 
parametric. 

The most important problem is the detection of the width h of the interval 
providing a satisfactory quality of approximation. If the bandwidth is chosen too 
large it will result in a large approximation error (bias). Small h will improve the 
bias, but because the number of data points falling in this interval will also be small, 
the variance of the estimator will be large. In the projection estimation framework 
the number of basis functions plays a similar role. The larger is the smaller is 
the modeling bias, and the larger is the variance. Thus we come back to the trade-off 
between bias and variance, that is to the problem of the choice of a "good" bandwidth. 

If the function / would be known or its smoothness would be given, then the 
bandwidth h would be easy to select. Unfortunately, in most real life problems no 
information about the regularity properties of the underlying signal is available. Thus 
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we need to construct a data-driven method which would adapt itself automatically 
to the properties of the function / and, particularly, to its probably spatially inho- 
mogeneous smoothness. One way of doing this is, instead of considering the single 
bandwidth h , to take a finite grid (usually of geometric type) of bandwidths {/ifcjlLi 
producing a growing sequence of nested neighborhoods of the reference point x . This 
pointwise-adaptive bandwidth (scale, localizing scheme) selection is based on the idea 
known as Lepski's method. This approach was proposed in a series of papers [3B], [SH], 
|4U] . The idea is as follows: suppose that a point x and some method of localization 
(a smoothing kernel) are fixed. One calculates a sequence of estimators corresponding 
to different scales, and the procedure searches for the largest local vicinity of the cen- 
ter of approximation x , that is for the largest bandwidth, for which the corresponding 
estimator is not rejected by the data. The calculated estimators are compared by the 
algorithm, and the adaptively selected bandwidth is the largest one such that the 
corresponding estimator does not differ significantly from the estimators with smaller 
bandwidths. Among other applications, this idea was further applied by Katkovnik 
as the intersection of the confidence intervals (ICI) rule (see [22]), by Spokoiny as the 
fitted log-likelihood (FLL) technique (see [33]) and as a two sample likelihood ratio 
test with application to change-point detection (see [63], [65]). The interesting recent 
paper by ReiB, Rozenholc, and Cuenod [58] presents a Lepski-type method based on 
the Wald-test statistics for robust and quantile regression estimation. 

It is well known from approximation theory that the smoothness of a function 
can be expressed via the quality of its approximation by a sufficiently regular kernel 
smoother (see [68] )■ The Taylor theorem can be considered from this point of view 
as well. Let the degree p of the Taylor polynomial be fixed. Then the quality of 
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approximation of a function / by the finite sum of the Taylor expansion and the width 
h of the proper vicinity of approximation also express the smoothness of / . Thus 
the procedure described above intrinsically adapts directly to the local smoothness 
properties of the unknown function / . One can also select simultaneously a kernel 
and a bandwidth, see the second part of [12] . 

Since the seminal paper by Lepski [38] dating back to 1990, the local pointwise 
adaptive methods based on Lepski's approach have showed their power being applied 
to image denoising [SJ], [20], [22], robust and quantile regression [SS], change-point 
detection and volatility estimation in time series [63], [H], [65], [11], density esti- 
mation [TD] and inverse problems, see [U] and the references therein. This list is 
not complete and just shows the possible spectrum of application. A new technique 
originating from [38j for spatially adaptive local constant approximation employing 
local-likelihood methods was suggested in [33] • This approach is based on the as- 
sumption that a regression function can be well approximated by a constant in a 
vicinity of a given point. The suggested test statistics Tik , 1 < I < k < K are based 
on the fitted local-likelihood (FLL), that is on the difference between the value of the 
local log-likelihood corresponding to the smaller scale at the point of its maximum 
and the maximum of the local log-likelihood corresponding to the larger scale. These 
statistics are used for data-driven detection of the size and shape of the homogeneity 
area. Lepski's selection rule from [38], see also ^J, is applied to the FLL-statistics, 
whereby chooses an adaptive scale (bandwidth hj: ) as the largest for which the values 
of Tijn are sufficiently small: 

k = max {k < K : Tim < 3/, ^ < < ^} • (1-1) 
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The crucial problem for such adaptive methods is the choice of critical values 31, ... , ^k-i ■ 
A "propagation approach" for choosing the parameters in the selection rule f l3.5p is 
advocated in [33] and |66]. The idea is to select the critical values to provide the 
prescribed behavior of the procedure in the simplest parametric situation. Then the 
procedure should work well even when the parametric assumption is violated. 

In [33] and [66] the local constant fit is considered. In Chapter [2] we generalize 
the FLL method to the local linear approximation in regression with heteroscedastic 
Gaussian noise, and the "propagation approach" is justified for the case of misspecified 
covariance structure. 

1.2 Local approximation 

1.2.1 Local polynomial estimators: basic properties 

Let us consider as a motivation for local polynomial fitting the case of a deterministic 
design in M. By the Taylor theorem any function /(■) in a Holder class , 
(3 > 1 can be represented, up to a reminder term, as f(t) ~ f{x) + f'{x)(t — x) + 
■ ■ ■ + f^^~^\t — xY'^ /{p — 1)! for t sufficiently close to x and p — 1 = [/3J . This 
suggests the use of a local polynomial approximation to f{t) in the form fe{t) = 
^(t — x)^6 with \E'(m) = . . . , {uy~^/{p — 1)!)^ and the vector of parameters 
e = e{x) = {e^^\ e^^\ . . . , ^(p-i))^ with e'^^^x) = f^\x) to be estimated. The main 
intrinsic issue is to detect an optimal "vicinity" of the point x in order to avoid over- 
or undersmoothing. 
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Consider a regression model 

Yi = f{Xi) + asi, i = l,...,n 

where Si are independent zero mean random variables with Kef = 1 . Given a point 
X G M , we aim to recover the value f{x) from the noisy data. Let Y be an n- 
dimensional vector of observations such that Y = (Yi, Y2, . . . , Yn)'^ . Denote for any 
i = 1, . . . ,n by \E'j the vector of values of the polynomial basis functions at the design 
points centered at the reference point x : 

% = vl>(X, - x) = (1, X, - X, ... , (X, - xy-'/ip - 1)!)^ 

and by ^ the pxn matrix with columns \E'i . Let W{u) be a nonnegative localizing 
function (smoothing kernel) having its maximum at zero and being finite or vanishing 
at infinity: W{u) — as |m| — ?■ 00 . To shorten the notation denote also by Wh,i{x) == 
^( ^V^ ) • '^^■^^ localizing scheme corresponding to a bandwidth h > then can be 
represented as a diagonal matrix of the form: 

Wh{x) = diag{wh,i(x), . . .,Wh,n{x)}. 

The following definition of local polynomial estimators is based on the ones from [72] 
page 35 and [31j pages 28-29. 

Definition 1.2.1. A vector 9h{x) G MP defined as a minimizer of the weighted sum 
of squares 

dh{x) = argmin \\Wh{xy^^ {Y - f 

eeRp 

n 

= argmin V - ^7^r«^M(a^) (2-1) 

^=l 
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is called a local polynomial estimator of order p — 1 of 6{x) . The statistic 

is called a local polynomial estimator of order p — 1 of f{x) . Here ei G MP is the 
first canonical basis vector. 

We will refer to the local polynomial estimators of order p — 1 of 6{x) and of 
f{x) as the LP{p — 1) estimator of 6{x) or of f{x) respectively. It is easy to see 
that for the properly normalized basis functions the LP{jp — 1) estimator of d{x) 
provides estimators of all derivatives of the function / of order less or equal p — 1: 

fi'\x) = ej^,6t,{x) , j = l,...,p-l 

with the j th canonical basis vector ej G MP 

The LP{p — 1) estimator dh{x) satisfies the normal equations 

B{x)ehix) = <ffWhix)Y (2.2) 

where the symmetric p x p matrix B(a;) is given by 

n 

B{x) = *W/,(x)*"^ = J2 '^i'^JwhA^)- (2-3) 

1=1 

If the matrix B(x) is positive definite (B(x) 0), the LP{p — 1) estimator is the 
unique solution of (I2.2p and is given by the following formula: 

n 

Ohix) = B{x)-'^Wh{x)Y = B{xy' '^iY^WhA^)■ (2.4) 
In this case the LP{p — 1) estimator fh{x) is a linear estimator of f{x) : 

n 

fAx) = Y.Y,W;{x) (2.5) 

i=l 
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where the weights W*{x) are given by: 

W*{x) = ejB{xy'^,WhM. (2.6) 



Recall the important reproducing polynomials property of the local polynomial 
estimator (see [72] page 36), and for a more general representation [31] page 85. 

Proposition 1.2.2. Let x G M &e such that B(x) >- and let Pp-i be a polynomial 
of degree less or equal to p — 1 . Then the weights defined by (12. 6 p satisfy 



1=1 

for any design points {Xi, . . . , Xn} ■ Particularly, 

n 

Y,W:{x) = l, (2.7) 

i=l 
n 

Y,[X,-xrWl[x) = ^, m = l,...,p-l. 

i=\ 

Proof. By the Taylor expansion 

m=l ^ ' 

with 0! = 1 and e{x) = {Pp-i{x), P;_i(x), . . . , P^^_\^\x)y . Then by (EE]) and (EJ]) 

n n 

Y,Pp-i{Xi)W:{x) = elB{x)-^Y,m,m'JwhA^)0{x) 

i=l i=l 

= ej0{x) = Pp_i(x). 

□ 
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1.2.2 Mean squared error of local polynomial estimators 

In this section we show a classical method for obtaining upper bounds for the quadratic 
risk of the LR{p—l) estimator under the assumption that the underlying function / 
belongs to a Holder class L) with p — 1 = [/3J . This analysis will be done via 
the traditional bias- variance trade-off. Later on in Section 12.51 it will be shown how 
this approach can be adjusted for the purpose of pointwise adaptation. In what fol- 
lows we assume a deterministic design with Xj G [0, 1] . Fix a point x G M and the 
method of localization Wh{x) . By (12. 4 p the local polynomial estimator Oh{x) can 
be easily decomposed into deterministic and stochastic parts: 



dh{x) = ei{x) + Chix), 



where 



n 




B{x)-'J2'^,Wf,,{x)f{X,) 



i=l 



n 



1=1 



Then 



fh{x) 



ei Oh{x) 



ei Ol{x) + ei C,hi.x) 



(2.8) 



with 



n 



ejeiix) = J2w:{x)f{X,), 



i=l 



n 



ei Chix) 



J2w:{x)e,. 
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Denote the variance of the stochastic part ej(^f^{x) of (12.81) by 

crl{x) = Vaiflfhix)] 

= ejE[C,{x)C,{xy]e, 

n 

= a'J2iW:{x)f. (2.9) 

i=l 

Define the bias (the approximation error) 

6,(x) EfiUix) - fix)] = ejeiix) - fix). 
Then the bias-variance decomposition for the mean squared error at x is given by 

MSE(x) = E;[|/,(x) - fix)\'] = hlix) + alix). (2.10) 
Using Proposition 11.2.21 the bias can be written as follows: 

n 

huix) = Y.ifiX,)-fix))W:ix). (2.11) 



i=l 



Following the line of presentation from [72], we impose the following assumptions on 
the localizing schemes and the design. 

(Xipi) There exists a number Aq > such that uniformly in x the smallest eigen- 
value fulfills Ap(B(x)) > hRXq for all sufficiently large n . 

(Xipz) There exists a real number ao > such that for any interval A C [0, 1] and 
all n > 1 



1 " f 1 

- G A} < aomaxj / dt,-]. 

n ^ Ja 



n 

(S^pS) The localizing functions (kernels) Wh,i are compactly supported in [0,1] 
with 

Wh,iix) = if \Xi — x\ > h. 
11 



This immediately implies a similar property for the local polynomial weights: 

W*(x)^0 if \Xi-x\ >h. 

{Slp4) There exists a finite number Wmax such that 

sup \Wh,i{x)\ < Wmax- 
i,x 

Lemma 1.2.3. Assume (ilpi) — (iip4) . Then for n sufficiently large and all h > 
and X e [0, 1] the local polynomial weights W* {x) are such that: 

sup\w:{x)\ < % 

i.x IT'iT' 



1=1 



with Ci = WmaxVe/Xo and C2 = 2wmaxO'oVe/>^o ■ 

Proof. Recall that B(x) is a symmetric non-degenerate p x p matrix. Then by 
the Schur theorem there exist an orthogonal matrix U and a diagonal matrix A = 
diag{A^^(B(a;) ),..., A-2(B(a;))} such that B{x)-^ = AU . Then by Assump- 
tion (£pi) for any 7 e Rf 

7^B(a;)-27 = 7^ C/^ AC/7 < {nhXoy^hf, 

implying 

||B(x)-S|| < {nhXo)-'h\\. 
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By (12. 6p . Assumptions (iips) and (Ilp4) and using that h < 1 , we have 
\W;{x)\ = \ejB{x)-^^il,WhAx)\ 

X^nh 



< 



< 



Wmax r ,2 h'^^P .1/2 

Wmaxf-, , 1 , , 1 n1/2 



where the upper bound Wmaxy/ei^onh)"^ does not depend on i and n. 

The second assertion of the lemma is obtained similarly. Condition (-Cp2) implies 



n 



z=l 



j=l 



< aomax{2, — I 

Ao nh 

for all /I > . □ 

Theorem 1.2.4. Let f e T.{(3,L) on [0,1] and let fh{x) he the LP{p - 1) esti- 
mator of f{x) with p—l= [/3J . Then under the conditions of Lemma \1.2.'^ for n 
sufficiently large and all h> ^ and x G [0, 1] , 



\bh{x)\ < C2 



{p-l)\ 



2/ N ^ (T^CiC2 

with Ci and C2 as in Lemma \1.2.3[ 

Moreover, the choice of positive bandwidth h = h*{n) given by (12.161) such that 

h*{n) = 0{n~W+T) 
13 



provides the following upper bound for the quadratic risk: 



where 



lim sup sup Ef[^lj-^\fj,{x) - /(a;)p] < C, (2.12) 



ijn = 0{n~wr.) (2.13) 



is given by (I2.17p and the constant C is finite and depends on f3 , L , , p , 
and oo only. 



Corollary 1.2.5. Under the conditions of Theorem \1.2.4\ we have the same rate for 
the MISE (mean integrated square error): 

lEK sup Efiij-^ [ \fh{x)- f{x)\'dx]<C (2.14) 

n^oo jg2(/3,L) Jo 

with the rate ipn given by fl2.13p and the finite constant C depending on (3 , L , , 
P, Wmax and ao only. 

Proof. By (12.111) and the Taylor theorem with such that the points TjXj are 
between Xj and x , we have 

n 

bh{x) = J](/(X,)-/(x))W,*(x) 

i=l 

j=l •'' i=l i=l 

The first sununand is equal to zero by Proposition ll.2.2[ By the same argumentation 
the second term can be rewritten as follows: 

^ (/(^-i)(r.X,) - f^^-'\x)){X, - xy-'W^x). 
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Then by Lemma [1.2.31 

L 



\hh{x)\ < - — —y2\T,x,-xf-^p-'^\x,-xr'\w*( 



L 



i=l 
n 



< a 



2" 



ip-l)\ 

By formula fl2.9p and Lemma 11.2.31 the variance is bounded by 



alix) < cr'snp\W:{x)\J2\W: 

i,x 



i=i 

< 



nh 

Then by f l230|) 

MSE(x) < Ca/i'^ + ^ (2.15) 
nh 

with Ci = a^CiC2 and C2 = C|L2((p - l)!)-2 . Then the optimal bandwidth h*{n) 
minimizing the upper bound for the MSE at x is given by 

/ (5 \w+i i_ 

h*(n) = — — n 

V 2/3^2/ 

This gives us the rate ipn w.r.t. the squared loss function over a Holder class L) : 

r \ 2/3+1 / rr^\ 2/3+1 



(p — 1)!/ \ n 



C n~WT (2.17) 



with C = 2WTu;„^^,yiAo'ao''^'/3 □ 
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1.2.3 Method of local approximation: general set-up 

In this section, following up to the notation the book [31] and the papers |29], [30] 
we will explain the basic idea of the method of local approximation in a more general 
set-up than in the previous section. Consider for simplicity the regression model 

Yi = f{Xi)+ei, i = l,...,n. 

If we want to recover f{x) at the point x , we put the center of localization at x . 
Suppose that some basis {ipj{-)} is chosen. Denote by \E'(m) = {ipi{u), . . . ,ipp{u))~^ 
a vector of the basis function. We believe that for t close to x the values f{t) can 
be well approximated by the finite sum 

p 

fg{t) = ^it-x)^e{x) = J2^^'\^)i'3it-^) (2-18) 

where "^{t—x) is the vector of values of the basis functions centered at x . Thus, to es- 
timate /(x) , we have to estimate the vector of coefficients 0(x) = {9^^\x), . . . , 9^p\x)) 
Let W{u) be a nonnegative localizing function (smoothing kernel) having max- 
imum at zero and being finite or vanishing at infinity: W{u) — )■ as \\u\\ oo . 
Denote also Wh,i{x) *= • Let F : M — t- M>o be a convex loss function. Then 

the solution (solutions) of the following minimization problem 

n 

Oh{x) = argmin V F{Y, - ^j0)whAx) (2-19) 
eeRp ^ 

with = ^(Xi-x), i = 1,... , n is the estimator of the vector 6 at the point x 
obtained by the method of local approximation. Notice that 0h(a;) is an M- estimator 
see [25] or [73]. The estimator 

A(x) v^(0)^g,(x) = X^^^'^(a;)^,(0) (2.20) 

i=i 
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is an estimator of the function / at the point x by the method of local approximation. 
In the case of the polynomial basis (1, u, v?, . . .) we have \l/(0) = (1, 0, ... , 0)''^ and 
fh{x) is just the first coordinate of 6h{x) . 

It was stressed in [31j (see page 29) that the optimal choice of the parameter 
of locality (bandwidth h ) is one of the most important issues of the nonparametric 
estimation. Katkovnik [31], see page 16, pointed out that the practical use of the 
estimators obtained by the method of local approximation, as well as of any estima- 
tors, requires to construct them adaptively, that is with a tuning of the parameters in 
accordance with the data in hand. This leads essentially to the traditional problem 
of testing the hypothesis about the model. The necessity of data-driven treatment 
motivates the application of the Lepski-type procedure to the selection of the scale 
(of the bandwidth ) and the "propagation conditions" approach on the choice of 
the critical values of the adaptive procedure (see Section 12. 3p suggested in [53] and 
in [66] and developed in the present work. 

The asymptotic properties of the estimators given by (12.191) and (12.201) were pre- 
cisely studied in [SH], [ZO] and [7T]. In [7T] it was shown that the estimators, con- 
structed by (I2.19P w.r.t. the convex loss function and the polynomial basis (1, . . . , u^) 
exhibit the best rate of convergence among all estimators of functions over Holder 
classes S(p — 1,L) on some bounded subset of M, as well as among all estimators 
of their derivatives. The use of a non-quadratic loss function F{-) is very impor- 
tant in the theory of robust estimation and allows to treat the noise with unbounded 
variance, see for instance the classical paper of Huber [23]. 

If the basis is an orthonormal basis in L2{X) for some compact C M"^ 

and the loss function is quadratic, i.e., if F{y) = y'^ , then the estimator 6h{x) 
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defined by f l2.19p is the weighted least squares estimator. If the matrix B(x) = 
Yl^=i ^ i^l Wh^i{x) is positive definite then one can write: 

n 

1=1 

In this case 6h{x) and fh{x) are linear estimators. Taking the polynomial basis we 
come back to the LP{p — 1) estimator introduced in the previous section. 

1.3 Information-based complexity and 

approximation in increasing dimension 

Computational complexity is a measure of the intrinsic computational resources re- 
quired to solve a mathematically formulated problem. It depends on the problem, 
but not on the particularly used algorithm. The notion "information" is used in the 
theory of complexity in the every-day sense of the word. The information is what 
we know about the problem to be solved. It should be stressed that this term used 
in Chapter [3] has nothing in common with Shannon's definition of information, nor 
with the KuUback-Leibler information criterion [36] used in Chapter |2j See [67] for an 
informal introduction, however containing a comprehensive overview of the literature. 

One can distinguish two different types of complexity. In the first case the infor- 
mation is complete, exact, and free; an example is provided by the traveling sales- 
man problem. This is the so-called combinatorial complexity. The information-based 
complexity that we are interested in here, is the computational complexity of (multi- 
variate) continuous mathematical models. This branch of computational complexity 
deals with the intrinsic difficulty of the approximate solution of a problem for which 

18 



the information is partial, noisy, and priced, see [51]. This is the case when deahng 
with continuous problems on infinite dimensional spaces. Only partial information 
such as a finite number of functional values is available. In this case the problem 
can only be solved approximately implying the presence of error. Usually one re- 
quires the problem to be solved with an error not larger than a threshold e . The 
information-based complexity is then defined as the minimal number n{e,d) of in- 
formation operations (functional values, for example), needed to solve the d-variate 
problem with an error not exceeding e . In different settings and for different error 
criteria, e may have different meanings, but always reflects the error tolerance. 

As pointed out in [IH] , a central issue is the study of how the information com- 
plexity depends on e^-^ and d . If n{e,d) depends exponentially on e"^ and d, 
the problem is called intractable. Many multivariate problems exhibit exponential 
dependence on d , called after Bellman |5j the curse of dimensionality. If the infor- 
mation complexity depends on and d polynomially, the problem is polynomially 
tractable. 

In spite of the existence of vast literature on the computational complexity of 
(i-variate problems, most of the papers and books study error bounds without taking 
into account the dependence on d . Research on tractability, requiring the knowledge 
of dependence on both and d , was started in the early nineties by Wozniakowski 
[76] . [77] . [78] . who introduced the notion of "tractability" and suggested to consider 
the dependence on c? as — )■ oo . This is important for numerous applications 
including physics, chemistry, flnance, economics, and the computational sciences. For 
instance, in quantum mechanics, statistical mechanics and mathematical flnance, for 
path integration the number of variables is inflnite; approximations to path integrals 
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result in arbitrary large d , see [59] and [19] for details. 

In average case settings the cost and the error are defined by their average perfor- 
mance. The general theory in the average case settings, among other approaches, was 
created by Traub, Wasilkowski, and Wozniakowski in [51]. The future development 
is presented by the monographs of Ritter [5S] and Novak and Wozniakowski [IH] . 

One of the problems which can be treated in this framework is the approximation 
(recovery) of functions. Let T = [0, Vf and J-" = C^{T) . We identify any f E 
with its embedding id{f) = / in the (weighted) Lp -space over T with 1 < p < oo . 
Let the data be the functional values /(ti), . . . , fitn) . Based on the data f{ti) an 
approximate solution (function) / is constructed. The average error of / is defined 
by (E||/ — /llp"'^/^ with some 1 < g < oo , where ||-||p denotes a (weighted) Lp-norm. 

Usually, the computational costs are proportional to the total number of functional 
values, and therefore to the information complexity. One aims at finding a "good" 
method / with average cost not exceeding a given bound and with minimal average 
error. Often one considers methods which use only the functional values f{ti). Then 
the key quantity in the average case settings is the nth minimal average error 



This minimal error states how well / can be approximated on average by (affine) 
linear methods using n functional values. Chapter [3] is devoted to the approximation 
of (i -parametric random fields of tensor product-type, which is a particular case of 
linear tensor product problems, see Chapter 6 of [19] for a general study. 
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Chapter 2 

Adaptive estimation under noise 
misspecification in regression 

We consider the problem of pointwise estimation in nonparametric regression with 
heteroscedastic additive Gaussian noise. We use the method of local approximation 
applying the Lepski method for selecting one estimator from a set of linear estimators 
obtained by different degrees of localization. This approach is combined with the 
"propagation conditions" on the choice of critical values of the procedure, as suggested 
recently by Spokoiny and Vial [SS] . The "propagation conditions" are relaxed for the 
model with misspecified covariance structure. Specifically, the model with unknown 
mean and variance is approximated by the one with the parametric assumption of local 
linearity of the mean function and with an incorrectly specified covariance matrix. 
We show that this procedure allows a misspecification of the covariance matrix with 
a relative error up to , where n is the sample size. The quality of estimation 

is measured in terms of nonasymptotic "oracle" risk bounds. 
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2.1 Model and set-up 

Consider a regression model 

Y = f + ^l^'e, e^Ar{0,Q (1.1) 

with response vector Y G M" and the covariance matrix Eq = diag((Tgi, . . . ,cTo„) . 
This model can be written as 

Yi = f{Xi) + ao^iEi, i = l,...,n 

with design points Xi E X C M.'^ . Given a point x E X , the target of estimation is the 
value of the regression function f{x) . We apply the method of local approximation 
described in Section 11.2.31 In view of the representation f ll.ip this means that we 
believe that at a vicinity of some given point a; G the unknown vector / can be 
well approximated by fg = ^0, where ^ is a given pxn matrix whose columns 
consist of the values, at the design points, of basis functions centered at x , that is, 
\E'('u) = (ipiiu), . . . ,il)p{u)Y for some basis {ipj} in L2{X) and =^ \I'(Xj— x) . The 
parameter 6 = (0^, 9'-^\ ^(p~i))T g 6 C W is the target of estimation, and we 
will choose the appropriate width of the localization window adaptively by application 
of Lepski's method. The covariance matrix Sq is not assumed to be known exactly 
and the approximate model used instead of the true one reads as follows: 

r = vEfT^ + s^/^e, (1.2) 

where E = diag(cr^, . . . , cr^) , minjcTj^} > . Thus the model is misspecified in two 
places: in the form of the regression function and in the error distribution. Following 
the abbreviation from Katkovnik [32] we will refer to this model as to "local poly- 
nomial approximation" or, more generally, "local parametric approximation" (LPA), 
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since it is assumed that the "true" model fll.ip locally can be replaced by the "wrong" 
parametric one. 

The model constraint on the form of the regression function includes the important 
class of polynomial regressions. For example in the univariate G IR , due to the 

Taylor theorem, the approximation of the unknown function f{t) for t close to x can 
be written in the following form: /^(t) = + - ■ ■+e^P-^\t-x)P-^ /{p-l)\ , 

with the parameter = {9''^\ 9^^\ . . . , 9^p^^^)^ corresponding to the values of / and 
its derivatives at the point x . The p x n matrix ^ then consists of the columns 
\E'j = (1, Xj — X, . . . , (Xj — xy~^ / (p — 1)!)^ , « = !,...,«. If the regression function 
is sufficiently smooth then, for any t close to x , up to a reminder term, f{t) fg(t) 
and the estimator of f{x) at the point x is given by the first coordinate of 6 , that 
is by f{x) = fg{x) = 9^^^ . See for further information on local polynomial regression 
Section \T7I[ or for more deep insight [T7|, [22] or [IS]. 

The general approach advocated in here includes also the important case of local 
constant approximation at a given point x G iR . In this case the design matrix 
* = (1,...,1) and fe{X,) = ^j6 = 9^'^ = fg{x), t = l,...,n. 

2.2 Quasi- maximum local likelihood estimation 

Fix a point x G M'^ and an orthogonal basis {ipj} in L2{X) . Let the localizing 
operator be identified by the corresponding matrix. Thus for every x the sequence of 
localizing schemes (scales) Wfc(x) , k = 1, . . . , K is given by the matrices Wfc(x) = 
diag(wfc i(x), . . . , Wfc,„(x)) , where the weights Wk,i{x) G [0,1] can be understood, 
for instance, as smoothing kernels Wk,i{x) = W{{Xi — x)h~^) . We assume that a 
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particular localizing function is fixed, and we aim to choose the index k of the 
optimal bandwidth based on the available data. To simplify the notation we 
sometimes suppress the dependence on the reference point x . Denote by 

w/=^S-V2w,S-V2 = diagfe,...,^y k = l,...,K. (2.1) 

Let G be a compact subset of . The LPA means that there exist non-zero weights 
Wk^i and a parameter G G such that f{Xi) ^ fe{Xi) = "^Jd for all Xi providing 
Wk,i > 0. The notation f{Xi) ^ "^Jd also has the meaning that the localized data 
distribution, obtained by restricting the measures iPf,Eo -^^^e Eo to the a -field 
generated by those data for which Wk^i > , are close to each other in a certain sense, 
see modeling bias in Section 12.4.31 

Under the LPA the corresponding local quasi-log-likelihood has the following form: 

L(Wfe,6>) = -^{Y-^^eyWk{Y-^^e)+R 

= 4^1^^-^^^!'^ + ^' (2-2) 

1=1 ' 

where R stands for the terms not depending on and 

Then, due to the assumption of the normality of the errors, for every k the quasi- 
maximum likelihood estimator (QMLE) Ok = Ok{x) = (^i°^(x), el^\x), . . .,elJ''^\x)y 

coincides with the LSE and is defined as the minimizer of the weighted sum of squares 
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from 

6^. =^ argmaxL(Wfc, 0) 
flee 

= argmin ||Wj/^ - ^^0) f 
6»ee 

= B,W,r = B,^X^vI/,F,^, (2.3) 
where the p x p matrix = Bfc(x) is given by 

n 

B,'^'^W,^^ = y2^,^J^. (2.4) 

That is, by Definition 1 1 . 2 . 1 1 in the case of the polynomial basis the estimator Bk{x) 
is a LPk{p — 1) estimator of 0{x) corresponding to fcth scale. In the following we 
assume that n > p and det B^ > for any k = 1, . . . , K . Because p = rank(Bfc) < 
min{p, rank(VVA;(x))} this requires the following conditions on the design matrix ^ 
and the minimal localizing scheme Wi(x) : 

The p X n design matrix ^ has full row rank, i.e., 

dimC(*^) = dimC(*^*) = p. 

(iioc) The smallest localizing scheme >Vi(x) is chosen to contain at least p design 
points such that wi^i{x) > , i.e., p < #{z : wi^i{x) > 0} . 

The condition (£oc) is automatically fulfilled in practise since, for example, in M} 
it means that for local constant fitting we need at least one observation and so on. 
Usually it is intrinsically assumed that, starting from the smallest window, at every 
step of the procedure every new window contains at least p new design points. 

The formulas (12. 3 p give a sequence of estimators {0k{x)}f^i . It was noticed in [2] 
that in the case of unknown true data distribution the MLE is a natural estimator 
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for the parameter maximizing the expected log-hkehhood. That is, for every k = 
1, . . . ,K , the estimator 0k{x) can be considered as an estimator of 

ei{x) = argmaxEL(Wfc,6») (2.5) 
= argmin(/ - *^0)^Wfc(/ - *^6>) 

n 

Recall that we do not assume that the regression function / even locally satisfies the 
LPA. It is known from [75] that in the presence of model misspecification for every k 
the QMLE 0^ is a strongly consistent estimator for Ol.{x) , which is the minimizer 
of the localized Kullback-Leibler [37| information criterion: 



WkAx 



eUx) = argmin^KL(Ar(/(X,),a,),Ar(vl/70,a,)) 



= argmin V|/(Xi) -*76>, ^ 

with KL(P, Pg) =^ Ep[log • For the properties of the Kullback-Leibler diver- 

gence see, for example, [72] . 

It follows from the above definition of Ol.{x) and from ( 12. 3p that the QMLE dk 
admits a decomposition into deterministic and stochastic parts: 

dk = B, i*Wfe(/ + Sj/'e) = 61 + B, i^W.S^'s (2.7) 
= 0*, (2.8) 

where £ ~ A/" (0, /„) . Notice that if the regression function indeed follows the LPA, 
that is if f = , then 6\ = 6 for any k and the classical parametric set-up is 
recovered. 



26 



2.3 Adaptive procedure 

Let a point x G A* C M" , an orthogonal basis {ipj} in L2{X) and the method of 
locahzation be fixed. The crucial assumption for the procedure under consid- 
eration to work is that the localizing schemes (scales) Wfc(x) = diag{wk,i, ■ ■ ■ ,Wk,n) 
are nested. Specifically, we say that the localizing schemes are nested if the following 
ordering condition is fulfilled: 

(W) For any fixed x and the method of localization the following relation 
holds: 

Wi{x) <...< Wfc(x) < . . . < Wk{x). 

For kernel smoothing this condition means the following. Let the sequence of band- 
widths {hk} be ordered by increasing magnitude, i.e., hi < ... < h^ , and let 
Wfc(x) = diag(wfc,i, . . . , be the localizing matrix, corresponding to the band- 

width hk . Here the weights Wk,i = Wk,i{x) = W{{Xi — x)h^^) E [0, 1] are nonnega- 
tive functions such that for any < hi < h/. < 1 it holds W{uh^^) < W{uh'i^^) and 
W{u) as |m| — 7- oo , or even are compactly supported. 

Recall that given a center of localization x E X , a basis {ipj} and the method of 
localization , we look for the estimator of f{x) having the form 

i=i 

The parameters 0^\x) , j = 1, . . . ,p are the components of the QMLE given by fl2.3p . 
The use of the adaptively chosen k gives the adaptive estimator f-j^{x) of f{x) 
corresponding to the adaptive window choice w-j: . (x) . In the case of the polynomial 
basis ipi{0) = 1 and ipj{0) = for j = 2, . . . ,p . Then the estimator of f{x) is 
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just the first coordinate 6^\x) . In this case we also can get the estimators for the 
derivatives of / at the point x . 

The index k G {1, • • • ,K} corresponds to the adaptive choice of the degree of 
locahzation (of the width of the window), and it will be obtained by application of 
Lepski's method, see below. Then the adaptive estimator of the parameter vector is 

d{x)'^9j:{x) = {m\x),...,¥{\x))\ (3.1) 

In a non-formal way the idea of the adaptive procedure used for selection of k can 
be described as follows. Let a point x and the method of localization W be fixed. 
For k = 1,...,K , let 0^ = dk{x) = {e'^\x), ^[^^(x), . . . , e['''^\x)y be the linear 
estimator defined by (12. 3p . We aim to choose an adaptive estimator 9{x) = %(x) 
from the set {6i, . . . ,9k} , that is to pick the adaptive index k from {1, . . . , K} . 
Following the Lepski method (see [38]), we will proceed with the multiple testing of 
homogeneity: starting with the smallest scheme Wi{x) and enlarging it step by step 
so long as the estimators 6i{x) do not differ from each other significantly. More 
precisely, to describe the test statistic, define for any , 0' E Q the corresponding 
log-likelihood ratio: 

L(w,, e, e') L(Wfc, 6) - L(w,, e'). (3.2) 

Then, using the approach suggested in [33], for every / = the fitted log- 

likelihood (FLL) ratio is defined as follows: 

hiWuGuO') ''=^UaxL(W;,0, 6'). 

By Theorem 12.4. ![ for any / and 6 , the FLL is a quadratic form: 

2i.{WuOue) = {ei-eyBi{ei-e). 
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Define the confidence set corresponding to Oi as 

8i{li) = {0:2L(W,,0,,0)<3,} 

= \e:(ei-eYBi{ei-e)<iiY (3.3) 

In terms of this definition "the estimator Ok does not differ significantly from Oi " 
means that Ok G £i{])i) ■ This prompts to use (see |33]) the FLL-statistics: 

= {di-0kVBi{di-0k) , l<k. (3.4) 

If Tik is significantly large, say Tik > ii for some sufficiently big value ii , then the 
discrepancy between Oi and Ok is not negligible and the corresponding hypothesis 
of homogeneity should be rejected in favor of the smaller one. Notice that this simple 
approach works only due to the condition (W) , because the hypotheses are nested. 

A justification of the FLL approach is given by the fact that the fitted log- 
likelihood ratio L(Wfc, Ok-, 01.) can be used to measure the quality of estimation of 0^. 
by its empirical counterpart Ok at each level of localization (see [2] and [75]). 

2.3.1 Algorithm 

Given the set of hnear estimators {^i, . . . , 0^} and the set of critical values {31, ... , ^k-i} , 
see the "propagation conditions" from the next subsection for details, one aims to se- 
lect in a data-driven way the estimator = 0^ with A; G {1, . . . , K} . The selection 
procedure originating from [38j is described as follows: 
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0k is accepted iff Ok-i was accepted and 

l<k l<k 

That is, we use Lepski's selection rule with the FLL test statistics {Ti^} '■ 

k = max {k < K : Tim < 3«, ^ < < ^} • (3-5) 

2.3.2 Choice of the critical values 

Let Ok denote the last accepted estimator after the first k steps of the procedure: 

^fc == ^mm{fc,fc}- (3-6) 

Denote for some n < K the hypothesis H^^ : 6\ = ■ ■ ■ = 0*^ = , which means that 
the LPA is fulfilled up to the step k . Clearly, by Assumption (W) for any k < k 
the hypothesis Hk is included in H^, . 

Following the idea proposed in jSH] we will choose the critical values 3i, . . . , ^k-i 
of the procedure using a kind of "level" conditions under the LPA (homogeneity 
hypothesis). In other words, the procedure is optimized to provide the desired error 
level in the local parametric situation. As it will be shown later (see Theorem 12. 4. 9p . 
if the procedure is tuned well under the LPA, it will perform well even when this 
assumption is violated. 

The Wilks-type Theorem 12.4.21 below gives the bound for the expected fitted log- 
likelihood ratio: 

E|2 L(Wfc, gfc, Ol)\' < (1 + SyCip, r) (3.7) 
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where the constant C{p, r) does not depend on the degree of locahzation and is given 
by: 

T(r + 2) 

C{p,r)=E\xl\^ = r^^^, (3.8) 

Take some "confidence level" a G (0, 1] . Then the set of K — 1 conditions on 
the choice of the critical values ^i, . . . ,^k-i can be defined to provide at each step of 
the procedure a risk of the adaptive estimators of at most an a -fraction of the best 
possible (parametric) risk (13 .7^ . These conditions are given by the following formulas: 

Definition 2.3.1. (Propagation conditions (PC)) 

The critical values 31, ... , ^k-i satisfy the following set of conditions: 

Eo^j^liOk -dkVBkiek -dk)r <aC{p,r) for all k = 2,...,K, (3.9) 

where C{p,r) is defined by (13. Sp . a G (0,1] and Eo,s stands for the expectation 
w.r.t. the measure A/'(0,S) . 

Remark 2.3.1. Lemma [2.6. II (see Section [2^6]) shows that under the LPA the Gaus- 
sian distribution provides a nice pivotality property: the actual value of the parameter 
6 is not important for the risk of adaptive estimator, so one can put 6 = in (13. 9p . 

Remark 2.3.2. Since the procedure is fitted in the parametric situation, ideally 
(while the LPA holds) it should not terminate. If it does, then the critical values are 
too small. This event will be referred to as a "false alarm". Therefore by the (PC) 
we require that at each level of localization the risk associated with the type I error 
is at most an a -fraction of the corresponding risk in the parametric situation. 
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2.4 Theoretical study 



2.4.1 Local parametric risk bounds 

To justify the statistical properties of the considered procedure we need the following 
simple observation. Let for any 6 , 6' E Q the corresponding log-likelihood ratio 
L(Wk,e,0') be defined by (JS^D- Then 

2 L(Wfc, 0, 6') = {Y - ^^e'Y^k {Y - *^^') -{Y - *^^)^ {Y - *^6>) . 

Theorem 2.4.1. (Quadratic shape of the fitted log-likelihood) 

Let for every k = 1, . . . , K the fitted log likelihood (FLL) he defined as follows: 

L(Wk,dk,e') = maxL(Wfc,6>, 0'). 

Then 

2L(W,,^fc, 6) = [Ok - eyBkiOk - 6). (4.1) 

Proof. Notice that L(Wjt, 0) defined by (12.21) is quadratic in 6 . The assertion follows 
from the Taylor expansion of the second order at the point 0k because it is the point 
of maximum and the second derivative is a constant matrix . □ 

In order to control the admissible level of misspecification for the "model" covari- 
ance matrix from (II. 2p we need to introduce the following condition on the relative 
variability in errors: 

{&) There exists 6 G [0, 1) such that 

1 — S < alJa^ < 1 + 5 for all i = l,...,n. 
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Let the matrix S be defined as follows: 

S = Sy^Wfc^TB-i^WfcSy^ (4.2) 

Then for the distribution of L(yVk, 0^,01) one observes the so-called "Wilks phe- 
nomenon" (see [in] ) described by the following theorem: 

Theorem 2.4.2. Let the regression model be given by f ll.ip and the parameter maxi- 
mizing the expected local log-likelihood 0\ = Ol.{x) be defined by (12. 5p . Then for any 
k = 1, . . . ,K the following equality in distribution takes place: 

2L(Wk,ek, ei) ^ Ai(S)£? + ■ ■ ■ + Ap(S)£j, (4.3) 

where p = rank(Bfc) = dim 6 = p , Ai(S), . . . , Ap(S) are the non-zero eigenvalues of 
the matrix S and Ei are independent standard normal random variables. 

Moreover, under Assumption {&) it holds that the maximal eigenvalue fulfills 
Amax(S) < 1 + 5 and for any 3 > 

ip|2L(Wfc,gfc,rj > 3} <iP{r7> 3/(1 + 5)}, (4.4) 

where rj is a random variable distributed according to the law with p degrees of 
freedom. 

Remark 2.4.1. Generally, if the matrix is degenerated in (14.31) the number of 
terms p < dim . 

Proof. By Theorem 12.4. II and the decomposition (12.71) it holds that: 

2L(Wk,dk,ei) = (Ok-oiyBkiek-ei) 

= (B,i*W,S;/'6)TB,(B,i*WfcS;/'£) 
= e^Se, 
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where the symmetric matrix S is defined by (14. 2p . Then by the Schur theorem there 
exist an orthogonal matrix M and a diagonal matrix A composed of the eigenvalues 
of S such that S = M"'' AM . For £ ~ A/'(0,/„) and an orthogonal matrix M it 
holds that e = Me ~ A/" (0, 4) . Indeed, EMe = Ee = and 

VarMe = EMe(M£)^ = ME{e£^)'Me = MM^ = /„. 

Therefore 

2 L{W,,0k, ei) = e^As , s ~ AT (0, /„) . 

^2 ~r 1 1/2 

On the other hand, the matrix S = Sq ^'W^Eq can be rewritten as: 

1/2 T 1 1/2 

with Hfc = W^'"*'B^'*W^'\ Notice that is an orthogonal projector onto 
the linear subspace of dimension p = rank(Bfe) spanned by the rows of matrix ^ . 
Indeed, 11^ is symmetric and idempotent, i.e., 11^ = 11^. 

Moreover, rank(nfc) = tr(nfc) = tr(Wi/^*^B^ ^*W^/^) = tr(B^^*Wfc*^) = 
tr(B^^Bfc) = tr(Jp) = p . Therefore 11^ has only p unit eigenvalues and n — p zero 
eigenvalues. Notice also that the nxn matrix S has rank(S) = rank(nfcW^^^SQ^^) = 
rank(nfc) = p as well. Thus 2 L(Wfe, 5^, 6»*) = \i{S)el + ■■■ + \p{S)el, where 
Ai(S), . . . , Ap(S) are the non-zero eigenvalues of the matrix S . 

Define the L2 -norm of a matrix A via its maximal eigenvalue 

\\A\\ ^X^aM^A). (4.5) 
Thus, taking into account Assumption (6) , the induced L2-norm of the matrix S 
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can be estimated as follows: 



= ||sfw;/^n,wf II 
< ||sfw;/^||||n,||||wfsf I 



max|wfc,i^-| 
i af 



< (1 + 5) max{wk,i} < 1 + S. 

Therefore the largest eigenvalue of the matrix S is bounded: Xmax{^) < 1 + 5 . 
The last assertion of the theorem follows from the simple observation that 

P {\i{^)el + ■■■ + Xp{S)el > 3} < r {\^US){ej + ■ ■ ■ + ej) > 3} • 



□ 



Corollary 2.4.3. (Quasi-parametric risk bounds) 

Let the model be given by ( II .ip and 01 = 6\{x) be defined by fl2.5p . Assume (©) . 
Then for any fj, < 1/(1 + 5) 

Eexp{/iL(Wfc,^fc,6>*)} < [l-^(l + <5)]-P/2 (4.5) 
E|2L(W,,0,,0*)r < il + 6yCip,r), (4.7) 



where 



T(r + S.) 

C{p,r)=E\xlr = r^^^. (4. 
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Proof. By (14. 3p and independence of 



Eexp{/iL(W,,0fc,0*)} = Eexp<^^5^A,(S)£,^ 



2 

i=l 



= nEexp{|A,(S)e?} 

1=1 

= n[i-/iA.(sr^/^ 

< [l-/xA^,,.(S)]-^/' 

Let 77 ~ Xp • Integrating by parts yields the second inequality: 

E|2L(Wfe,gfc,0*)r = ^"iP{2L(Wfc,0fc,r,)>3}r3'-M3 

/>oo 

< r / iP{r7> 3/(1 + 5) }3^-M3 
Jo 

= {1 + 6Y E\r]\'- . 



□ 



2.4.2 Upper bound for the critical values 

Let us recall the (partial) Lowner ordering of matrices: for any real symmetric matri- 
ces A and B we will write A ^ B if and only if Ai) < B i} for all vectors i) , 
or, equivalently if and only if the matrix B — A is nonnegative definite. 

Assuming {&) the true covariance matrix fulfills Sq ^ S(l + 6) and the variance 
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of the estimator 6^ is bounded above by Bj. ^ : 



-1 



(4.9) 



-1 



(1 + 5)B^^*E-V2>Vfc'E-^/2*^B^^ 



^ (1 + 5)B^^*E-^/2yy^5^-l/2^T3-l 



(l + (5)B-i*Wfe*^B 



(1 + <^)B,-^ 



(4.10) 



The last inequahty follows from the observation that all the entries of the "weight" 
matrix Wk do not exceed one, implying -< Wk ■ Strict equahty occurs if the 
{wk,i} are boxcar (rectangular) kernels and the noise is known, i.e., 5 = 0. To 
justify the procedure one needs to show that the critical values chosen by the {PC) 
are finite. The upper bound for the critical values is obtained under the following 
assumption: 

(^) Let the matrices B^ satisfy 



for some constants uq and u such that 1 < uq < u for any 2 < k < K 

Remark 2.4.2. In the "one dimensional case" p = 1 , that is for local constant 
approximation, the "matrix" B^ = X^^Li '^k,i'^T'^ ^ ^k-i is just a weighted "local 
design size" . Assume for simplicity that af = cr^ , the weights are rectangular kernels 
Wk,i{x) — — x\ < hk/2} , and the design is equidistant. Then for n sufficiently 

large 



uolp Bj. Bk B^_{ ^ ulp 



1 



1 



n 



h, 




i=l 
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and the condition (^) means that the bandwidths grow geometrically: h/^ = uh^^i . 

Denote for any / < k the variance of the difference Oj. — Oi by Vik '■ 

Vi, = VaT{dk-di)yO. (4.11) 

Then there exists a unique matrix V^^^^ >- such that (V^^''^)^ = Vik ■ 

Lemma 2.4.4. Assume {&) , (W) and (!B) . If for some k < K the LPA is fulfilled, 
that is if 6\ = ■ ■ ■ = 0*j. = 6 , then for any I < k it holds that: 

iP {2 L(Wi, g,, gfc) > 3} < ip|r7>3/A„,,.(l^f B^l^f )} 

iP |2L(Wfc,gfc, 00 > 3} < iP{r?>3/A™ax(^f BfcV;^/')} 

< iP{r/>3Ai}, 

where to = 2(1 + 5)(1 + Mq ^''"'^) , ti = 2(1 + 5)(1 + m^^^'^) and rj is a Xp- distributed 
random variable. 

Proof. The LPA and ([221) imply 

where ^ is a standard normal vector in . Thus by Theorem 12.4.11 under the LPA 
for any / < k 

By the Schur theorem there exists an orthogonal matrix M such that 
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where e is standard normal vector, A = diag(Ai(V^^'^^B;V^^^^)), ■ ■ ■ , Xpiyl^/'^'QiVlJ'^)) 
and p = rank(Bi) . Therefore 

where Aj(V5^^^B;VJ^^^) , j = 1, . . . ,p are nonzero eigenvalues of VJ^^^B^V^^^^ . 
By a similar argument: 

Recalling that r/ is a Xp "distributed random variable, we have 

p{^2L{Wi,ei,0k)>i} < iP{r/>3/A^,,(V,f B;V^f )}, 
ip{2L(W,,0fc,0O>3} < iP{^>3/A,na.(V^zfBfcV^f )}. 

Notice that for any square matrices A and B , 

{A - B){A'^ - B^) ^ 2{AA^ + BB'^). 

Application of this bound to the variance of the difference of estimators yields 

i^fc = (B-i*Wis;/'-B,i*Wfcs;/')(B-i*w,s;/'-B,i*w,s;/')T 

< 2(Br'*WzSoWi*^Br' + B-i*WfcSoWfc*^B-i) 

where Vi = Var 6i , I < k . By the upper bound (14.101) for the variance Vi (resp. of 
Vk ) and by Assumption (53) : 

^ (l + 5)B-\ 

^ (i + 5)B^i^(i + 5X('=-')Br\ 

39 



Therefore 

B,^2{l + 6){l + u,^'-^^)Vr,\ (4.12) 
Thus by ( I4.12p the upper bound for the induced L2 matrix norm reads as follows: 

■>^max[Vik tilVif. ) - II Vi,^ II 

= sup 7^V^f B.y.f 7 

Il7ll=l 

< 2(1 + + sup 7^V^,f ^^^^f7 

Il7ll=l 

< 2(1 + 5)(1 + (4.13) 



Similarly: 



Vik ^ 2(1 + 5)(1+m('=-'))B-\ 
XmaAViTBkV.'J') < 2(1 + (5)(1 + «('=-')). (4.14) 



These bounds imply 



iP {2 L(Wfc, 0z) > 3} < lP{7]>i/X^aAV'J^BkV'J^)^ 

< iP {77 >3[2(1 + 5)(1 + «('=-'))]-!}. 



□ 



Lemma 2.4.5. Under the conditions of the preceding lemma for any jjQ < , or 

fii < t^^ respectively, the exponential moments are bounded: 

Eexp{/ioL(W^,g,,0,)} < [l-/ioto]"^/' 
Eexp{/iiL(W,,^fc,^0} < [l-/il^l]-^/^ 
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where to = 2(1 + S){1 + < ^''"'^) and h = 2(1 + S){1 + m^^"')) . 

Proof. The proof of lemma is similar to the proof of Corollary I2.4.3I The bounds 
(14. 13p and fl4.14p imply the following bounds for the corresponding moment generating 
functions: 

Eexp{^L(W,,g,,^,)} = f[Eexp{^A,(V,fB,l^f 

< [l-/iA_(\/,fB,V,f)]-^/2 

< [l-2Ml + 5)(l + V'"'))]-^/^ 
Eexp{/iL(Wfc,0,,gz)} < [l-/iA^,,(V^;/'B,V^f 

< [1 -2/i(l + 5)(l + M('=-'))]-P/2. 



Lemma 2.4.6. Under the conditions of the preceding lemma it holds that: 
E|2L(Wfc,0fc,^Or < 2'■C(p,r)(l + 5)'^(l + M('=-'))^ 



where 



T(r + 2' 



^2> 

Proof. Integration by parts and Lemma 12.4.41 yield for the second assertion 
E|2L(Wfc,^fc,^0r = r iP |2 L(Wfc, ^fc, ^0 > 3} 3'"'d3 

■ ip[v>i [2(1 + 5)(i + 



"CO 

< r 



2*^(1 +5)''-(i+w^''"'^)'^i^r 
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□ 



where tj ^ Xp- The first assertion is proved similarly. □ 

Theorem 2.4.7. (The theoretical choice of the critical values) 

Assume (D) , (iloc) , (?B) and (W) . The adaptive procedure fl3.5p in the con- 
sidered set-up is well-defined in the sense that the choice of the critical values 



3fc = - ^r{K - k) log u + log {K/a) - | log(l - 4^) - log(l - m"") + C(p, r) } 



/i 

(4.15) 



provides the conditions f l3.9p for all k < K . Here C{p, r) = log 
and /i G (0, 1/4) . In particular, 



2^'-[r(2r+p/2)r(p/2)]i/^ 
r(r+p/2) 



Eo,E|(^i^ - 0)^BA'(0i, - 0)r' < aC(p,r). (4.16) 

Proof. The risk corresponding to the adaptive estimator can be represented as a sum 
of risks of the false alarms at each step of the procedure: 

k-l 
m=l 

By the definition of the last accepted estimator dk the event {6k = Om} with 
m = 1, . . . , — 1 occurs if for some I = 1, . . . ,m the statistic Ti^^+i > h ■ Thus 

m 

{^k = 0m} ^ [J{^«,m+1 > 3«}- 

1=1 

It holds also that for any positive /i 

I{Ti^m+i>3i} = I{2L(Wi,6i,6^+i) - > 0} 

Application of this simple fact and the Cauchy-Schwarz inequality implies for m = 
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1, . . . , A; — 1 the following bound: 



IEo,e|(^A: — dmV'BkiOk — 0jn)\''I{Ok = ^m} 

= Eo,E|2L(Wfc,g,,g^)ri{^fc = g„} 

m 

< ^e-^^'Eo,s [\2L{Wk,dk,drn)\'exp{^L{Wi,di,ern+i)} 



(=1 



|2r 



1/2 



0,1; 



exp {fiL(W 1,61,6,^+1)} 



By Lemma 12.4.51 with 5 = 



E, 



exp{fiL(Wi,6i,6m+i)} <(l-4/i) 



This together with the bound from Lemma 12.4.61 gives 
Eo.sK^fc — 6ky'Bk{6k — 6k)\^ 

k—l m 

< 2VC(P, 2r)(l - 4/i)-P/^ ^ ^ e-^^'(l + u^'"""^y 

m=l 1=1 

k-1 fe-1 



m=l 



k-l 



r{k-l) 



1=1 



because —{k — l) < —{m — I) and 



k-l 



r{k~l) 



k-l 



m=l 

k-l 



Since u"^^^ < m***-^ for any I < k < K the choice 

3z = - - /) logM + log (K/a) - ^ log(l - 4/i) - log(l - + r)} 
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with 

22'-[r(2r + j9/2)r(p/2)]i/2 



C(p, r) = log 
provides the required bound 



T{r+p/2) 



Eo,j:\{Oi-eiyBi{6i-Oi)\'- <aC{p,r) for all 1 = 2,. ..,K. 



□ 



2.4.3 Quality of estimation in the nearly parametric case: 
small modeling bias and propagation property 

The critical values ^i, ■ ■ ■ ,^k-i were selected by the propagation conditions fl3.9p 
under the hypothesis of homogeneity of the theta's with a probably misspecified 
error distribution, i.e. under the measure A/" {6, S) . Now 0^ ^ ■ ■ ■ ^ O*^ ^ 
up to some k < K and the covariance matrix is Sq . The aim is to formalize the 
meaning of " ~ " and to justify the use of the critical values in this situation. For this 
purposes we will take into account the discrepancy between the joint distributions 
of the linear estimators 0i, . . . ,0^ for k = 1, . . . ,K under the null (homogeneity) 
hypothesis corresponding to the distributions with mean 0\ = ■ ■ ■ = 0\ = and 
"wrong" covariance matrix S and in the general situation (under the alternative) 
with 0\^ ■ ■ ■ ^ 0\ and covariance matrix Sq . Denote the expectations w.r.t. these 
measures by E^^s := ^k,e,Y, and E/^s,, := E^j^Sq respectively. Denote a. pxk matrix 
of the first k estimators by 

0fc = (^1, • • • , ^k)- 
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Its mean under the alternative (the matrix of the parameters minimizing the expected 
local log-likelihoods) is given by 



0fc = %,2o0fc = {Gl,---,^*k 



and the mean under the null (the "true" parameter in the parametric set-up) is: 



Let A® B stands for the Kronecker product of A and B defined as 



A®B 



auB ai2B 
a2iB 022-8 



ainB 

0'2nB 



y O-mlB a„i2B ■ ■ ■ ttmnB J 



Denote the pk x pk covariance matrices of vec @J = {6i , . . . , 0^ ) G M^^^ by 



= Var0,s[vecefe] = Dfc(Jfc®S)Dj, 



def 



■'fc.O 



Var_/^,so[vec0fc] = 0^(7^ O So)D 



T 
k ' 



(4.17) 
(4.18) 



where the matrix Jk is a k x k matrix with all its elements equal to 1 and the 
pk X nk matrix is defined as follows: 



Dk = diag(Di, ...,Dk 



D, 



def 



k. 



(4.19) 



By Lemma [2.6.21 from Section [2l6] under Assumption (©) with the same S , a relation 
similar to {&) holds for the covariance matrices I]^ and I]fc^o of the linear estimators: 



(1 - <5)Sfc ^ Sfc,o ^ (1 + 5)5^fc , k<K. 



(4.20) 
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Even though the moment generating function of vec k has a form corresponding 
to the multivariate normal distribution (see Lemma 12.6.41 in Section 12. 6p this repre- 
sentation makes sense only if "Ex is nonsingular. Notice that rank( J/^ ® S) = n . 
From S ^ it follows only that S/^ >z , similarly, "S^fi ^ . However, with- 
out any additional assumptions it is easy to show (see Lemma 12.6.31 in Section 12. 6p 
that for rectangular kernels "Ex >- . On the other hand, due to fl4.20p . it is enough 
to require nonsingularity only for the matrix "Ex corresponding to the approximate 
model f ll.2p . and its choice belongs to a statistician. In what follows we assume that 

Denote by F^ j. = Ar(vec0fc,Sfe) and by iP^^ ^o = AT (vec 0^, S^^o) , k = 
1, . . . ,K , the distributions of vec 0^ under the null and under the alternative. De- 
note also the Radon-Nikodym derivative by 

Z, ^P^. (4.21) 

Then by Lemma [2.6.51 from Section [2^6] the KuUback-Leibler divergence between these 
measures has the following form: 

'diP^ 



2KL(iP;,s^,iP,^^s) = 2Ej,Solog 



where 



m + log (^^^) + tr(S,-iS,,o) - pK (4.22) 



h{k) = vec0*fc -vec0fc, (4.23) 



A(A;) = b{kyE^%k). (4.24) 

If there would be no "noise misspecification" , i.e., if 5 = implying S = Sq , then 
A{k) = h{kyi:^H{k) = 2KL(iP_f 2>^0,s) • Therefore this quantity can be used to 
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indicate the deviation between the mean values in the true (11 .ip and the approximate 
fll.2p models. Clearly, under (W) the quantity A(A;) grows with k , so following the 
terminology suggested in [66], we introduce the small modeling bias condition: 

(SMB) Let there exist for some k < K and some a constant A > such that 

A{k) < A. 

Monotonicity of A(fc) and Assumption (SMB) immediately imply that 

A(fc') < A for all k' < k. 

The conditions f l4.20p yield —pk5 < tr(S^^Sfc o) — pk < pk5 . Thus (16 .yp implies the 
bound for the Kullback-Leibler divergence in terms of 6 : 

- f log(l + + ^ - f < KLl^'A..^.) < -f log(l - ^) + ^ + 

(4.25) 

Moreover, as 6^0+ 

A{k) - 2pk6 + o{6) < 2KL(iP^ < ^(^) + 2pA;5 + o{6). (4.26) 

This means that if for some k Assumption (SMB) is fulfilled and 6 = , then 
the Kullback-Leibler divergence between Fg^, -F/So bounded by a small 

constant. 

Now one can state the crucial property for obtaining the final oracle result. 

Theorem 2.4.8. (Propagation property) 

Assume (D) , (£oc) , (6) , (W) , (53) and (PC) . Then for any k < K the 
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following upper bounds hold: 



where (f{6) =^ 



for homogeneous errors, 



IP-^ — 1 otherwise. 



Here 6^ = Ok{x) is the QMLE defined by (12.31) and 0k{x) = 0^in{kk}(^) 
adaptive estimator at the k th step of the procedure. 

Remark 2.4.3. Bounds (14.301) and (14.291) below give a kind of condition on the 
relative error in the noise misspecification. As 5 — )■ 0+ it holds for every k < K 

'fi^^^rrl - + o(5) < logEedZl] < viS)^^^ + 2pk6 + o{6), 

i + 1 — 

where Zk is defined by (]4.2ip . 

This bound implies, up to the additive constant log (aE|Xp|'^) /2 , the same asymp- 
totic behavior for the logarithm of the risk of adaptive estimator at each step of the 
procedure. Because by (SMB) the quantity A(/c) is bounded by a small constant 
and K is of order logn, Eg sl^l] is small if 5 = o(j^) . This means that for 
the case when E is an estimator for So , only logarithmic in sample size accuracy is 
needed. This observation is of particular importance, since it is known from [64J that 
the rate n^^/^ of variance estimation is achievable only for dimensions d < 8 over 
classes of functions with bounded second derivative. 
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Remark 2.4.4. The propagation property guaranties that the adaptive procedure 
does not stop with high probabihty while ^{k) is small, i.e. under {SMB) , and if 
the relative error 6 in the noise is sufficiently small. 

Proof. Notice that for any nonnegative measurable function g = g{@k) the Cauchy- 
Schwarz inequality implies 

%,EoM = Ee,^[9Z,] < {Ee,^[g']Y^\Ee,^[Zl]Y^' (4.27) 

with the Radon-Nikodym derivative 

Taking g = \{0k — 0)~^'Bk{Ok — 0)]'^^'^ one gets the ffist assertion applying "the 
parametric risk bound" with 6 = from (14.71) : 

= (E,,2 1 2 L(Wfc, 0,, 0) Y) (E,,s [Zl]) 

< mxirr'^^oAzi])"'". 

The second assertion is treated similarly by applying the pivotality property (Lemma l2.6.ip 
and the propagation conditions (13. 9p . 

To calculate E^^eI^^I] let us consider \ogZk given by 

+ i||S,-"'(y-vecejir' 

as a function of vec . Application of the Taylor expansion at the point vec 0^ 



49 



yields 

21ogZ, = log-f^-||S-;/^(y-vec0.)f + ||S-^/^(y-vec0,)f 
clet 2jfc^o 

+ 2b{kyi:^},{y - vec 0^) - 

With ^ ~ A/'(0, Ipk) the second moment of the Radon- Nikodym derivative under the 
null hypothesis reads as follows: 



= i^[det(2Efs-sf -I,,)]-/^ 

X exp{26(fc)"^S,-isf (2Sf S,-;,sf - /,,)-'sf S,>(fc) - 6(A:)^S,>(A:)} 

= ^[n{2A,(SfE-sf)-l}]-/^ (4.28) 



To estimate the obtained expression in terms of the level of noise misspecification 5 
notice that the condition f l4.20p implies 

1 V . detSfc ( \ 



< — < , 

1 + 5y ~ det Sfc^o ~ V 1 - ^ 

i^) ' < [n{2A,(Ef S-Sf ) - 1}]-"^ < (i±|' ' 



Therefore the quantity in the exponent in ( ]4.28p is bounded by: 



< 2 
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Moreover, 



1+6 1+S ^ ^ k \ J 



Finally, 



(i+sr) """^ + s)' ji+sj 

In the case of homogeneous errors the expression for log Zk reads as 

logZ, = pHog(^) + i(i^-i^)||V-^%-vec0,)|p 

(Jo Z (J (Jq 

+ ^KkV^k^y - vec 0,) - :^b{kyy^'b{k), 

implying 

By the condition {&) 



pk 

5 \~ 



(1 + 5) 



exp 




pk 

<^eAZl] < [-^^)^ ^-Viz^^l (4-30) 

where p is the dimension of the parameter set and k is the degree of the localization. 

□ 
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2.4.4 Quality of estimation in the nonparametric case: the 
oracle result 

Define the oracle index as the largest index k < K such that the small modeling bias 
condition (SMB) holds, that is 

k* = max{k < K : A(A;) < A}. (4.31) 

Theorem 2.4.9. Let A(l) < A, i.e., the first estimator is always accepted by the 
testing procedure. Let k* be the oracle index. Then under the conditions (2)), (-Coc) , 
{&) , (W) , (55) the risk between the adaptive estimator and the oracle is bounded by 
the following expression: 

E\{ek*-dyBk,{ek*-d)\'/^ (4.32) 



where (p{S) is as in Theorem\2.4 



Proof. By the definition of the adaptive estimator 6 = 0j^. Because the events 
{k < k*} and {k > k*} are disjunct one can write 

E|(^,. -^)^B,.(^,. -^)r/2 

+ E\{e,,-ej:yBk,{ek*-e^W/h{k>k*}. 

li k < k* then 0^* '= ■ su,Z\ = ■ Thus to bound the first summand it is enough 
to apply Theorem 12.4.81 with k = k* . 

To bound the second expectation, i.e. to bound fiuctuations of the adaptive 
estimator at the steps of the procedure for which the SMB condition is not fulfilled 

52 



anymore, just notice that for k > k* the quadratic form coincides with the test 
statistic T^, ^ 

= (Ok* - 6^yBk* (dk* - 0%) = T^^^k- 

But the index k was accepted, this means that T^-j^ < for all / < /c and therefore 
for l = k* . Thus 

Eiidk^ - dyBk*{9k^ - d)\^/H{k > k*} < ^. 

□ 

2.4.5 Oracle risk bounds for estimators of the regression func- 
tion and its derivatives 

Theorem 12.4.91 provides an oracle risk bound for the adaptive estimator 0{x) = Oj^{x) 
of the parameter vector 0{x) G of the finite-rank expansion from the method of 
local approximation, see Section ll.2.3l for details. This is equivalent to the estimation 
of the parameter of the local linear fit of the form at the point x to the 

model (11. ip under misspecification together with the adaptive choice of the degree of 
localization (of the bandwidth). If the basis is polynomial and the regression function 
/(■) is sufficiently smooth in a neighborhood of x , then 0{x) is the adaptive local 
polynomial estimator LP"^(p— 1) of the vector {f^^\x),...,f^~^\x))^ of the values 
of / and its derivatives (if they exist) at the reference point a; G M"' under the model 
misspecification. 
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Now we are going to obtain a similar oracle result for the components of the 
vector 6{x) , particularly for ej9{x) , j = 1, . . . ,p , where e^- = (0, . . . , 1, . . . , 0)^ is 
the j th canonical basis vector in . As a corollary of this general result in the case 
of the polynomial basis we get an oracle risk bound for LP"-'^{p — 1) estimators of 
the function / and its derivatives at the point x . 

Denote the LPk{p—l) estimator of f^^~^\x) corresponding to the /cth scale by 

ft'\x) = e]0,ix), j = l,...,p, (4.33) 
fk{x) = fj!'\x)=ejd,{x). 

Then the adaptive local polynomial estimators are defined as follows: 

= ej6{x), j = l,...,p, (4.34) 
fix) = ejd{x). 

Similarly, the adaptive estimators of the function / and its derivatives corresponding 
to the k th step of the procedure are given by 

fy~^)^^^<t'e]d,ix), j = l,...,p. (4.35) 

Thus, if the basis is polynomial, the estimator /(x) = f^^\x) is the LP'"^{p - 1) 
estimator of the value f{x) , and f'^^'^\x) with j = 2, . . . ,p are, correspondingly, 
the LP°''^{p — l) estimators of the values of its derivatives. We will use the polynomial 
basis to obtain the rate of convergence, but it should be stressed that the results of 
Theorems 12. 4. 9l and l2. 4. 151 hold for any basis satisfying the conditions of the theorems. 
We need the following assumptions: 

There exist < amin < o'max < oo such that for any i = 1, . . . ,n the variance 
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of the errors in the "approximate" model fll.2p is uniformly hounded: 



2 ^2^2 

a ■ < a < (J 

mm — I — max 



(ilpi'^) Let assumption (<3i) be satisfied. There exists a number Aq > such that 
for any k = 1, . . . , K the smallest eigenvalue fulfills Ap(Bfc) > ri/i^AoO""^^, for 
n sufficiently large. 

Then, because B^t >- , for any k = 1, . . . K we have 



for any 7 G , and we obtain the following lemma: 

Lemma 2.4.10. Let (©1) and (iipi'^) be satisfied. Then for any j = 1, . . . ,p and 

k, k' = 1, . . . K the following upper bound holds: 

^) |ej0,-eJ^,,|<||Bf(g,-^,OII- 



max 



Proof. By fOBjl taking 7 = B^/^(0fe - Qy) we have 



fc' 



cr; 



2 



rt/i.^Ao 



□ 



To obtain the "componentwise" oracle risk bounds we need to recheck the "propa- 
gation property". First, notice that the "propagation conditions" fl3.9p on the choice 
the critical values 31, ... , Ik-\ imply the similar bounds for the components ej0k{x) . 
Recall that dk == djnm{kk} ■ Then, by (13.91) . Lemma [2. 4. 101 and the pivotality property 
(Lemma 12.6. ip we have the following simple observation: 
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Lemma 2.4.11. Let (61) and (iipi^) be satisfied. Under the propagation condi- 
tions (PC) for any 6 &W and all k = 2, . . . , K we have: 

< aC{p,r). 

Here Eo,s stands for the expectation w.r.t. the measure A/'(0, S) and C{p,r) = 

As in the first parts of tliis cliapter to make tlie notation sliorter we will suppress 
the dependence on x . To get the propagation property we study for k — 1, . . . , K 
the joint distributions of eJOi, . . . ,ej6k , that is the distribution of ej@k , the j th 
row of the matrix 0^ , under the null and under the alternative. Obviously, 

%,Eo[e7e,] = ej@l = {ejdl, ejdl), 

and the mean under the null (the true parameter in the parametric set-up) is: 

Ee,E[eJefc] = eJOfc = {e]e, ejd). 

Recall that the matrices S^^o and have a block structure. Now, for instance, 
to study the estimator of the first coordinate of the "best parametric fit" vector (or 
of f{x) in the case of the polynomial basis) we take the first elements of each block 
and so on. Denote the k x k covariance matrices of jth elements of the vectors 
di,...,dk by 

= Dfcj(Jfc ® E)Dj^- under the null, (4.37) 

= Dfcj(Jfc ® So)DJ,^ under the alternative, (4.38) 
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where is a k x k matrix with all its elements equal to 1 , and the k x nk block 
diagonal matrices ,,• is defined by 



Bkj = ejD, © ■ ■ ■ © ejDk, = {h ® eJ)Dfc 

Di = B-^*W;, l = l,...,k. (4.39) 

Moreover, the following representation holds: 

= {Ik®e.jfi:k{h®ej), (4.40) 
where is defined by fl4.17p . Similarly, 

SfcAi = {h ® ej)^Sfe,o(4 ® ej). (4.41) 

Thus, the important relation fl4.20p is preserved for and S^ oj' obtained by 

picking the (j, j) th elements of each block of and S^ q respectively. 
With usual notation for the j th component of 7 G M'^ , denote by 

h,{k) {ej{ei-e),...,ej{ei-e)y 

= {{Ol-OY^\...,{Ol-OY^^y eR\ (4.42) 

A,{k) h,{kYi:,]h,{k). (4.43) 

Theorem 2.4.12. ( "Componentwise" propagation property) 

Under the conditions (D) , (£oc) , (6) , (61) , (PC) , (®) , (W) and (£pi^) 
for any k < K the following upper hound holds: 



a2 

max 



E\e; Okix) - 0k{x) 
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where (p{6) =^ 



for homogeneous errors, 



— 1 otherwise. 



Corollary 2.4.13. Let the basis be polynomial. Then under the conditions of the 
preceding theorem the following upper bound holds: 

V ^max J 

with ip{5) as before. 

Proof. The proof essentially follows the line of the proof of Theorem 12.4.81 If the 
distributions of vec 0^ under the null and under the alternative were Gaussian, then 
any subvector is also Gaussian. Denote by JPq'^ = J\f . . . , ejoy , S^j) and 

by JP^'j.^ = U {{e]e\, . . . , e]eiy, Sfc,o,i) , A; = 1, . . . , if , the distributions of e]@k 
under the null and under the alternative. 

By the Cauchy-Schwarz inequality and Lemma 12.4.111 

r/2 



\ ^max J 



with the Radon-Nikodym derivative given by 

^./=^%^. (4.44) 

By inequalities ( I4.40p and fl4.4ip the analog of Assumption {&) is preserved for S^. o,j 
and Sfcj , that is, there exists 5 G [0, 1) such that 

(1 - 5)Sfc,, ^ S,, oj ^ (1 + 5)Sfc,,- (4.45) 
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for any k < K and j = 1, . . . ,p . By the Taylor expansion at the point . . . , 

with ~A/'(0,4) 



^g^exp{-||S-;/;6,(^)f} 



X E 



k 



-1/2 



/=1 



Now utihzing fl4.45p we get 



-1 1 + 5 



1-5 



r,-r^-l/2^1/2/r,-r^l/2^-l ^1/2 ^ j 



1^1/2^-1/2 



-< 2- 



1 + 5. 



-1/2 



1-5 

, 1 + ^ 
'(1-5)^ 



-1/2 



114, 



detSfcj ^ / 1 



detS 



k,0,j 



1-5 



n[ 

i=i 



-1/2 /l + 5\2 

^ 1^ 



Finally, because bj{k)~^'Sf^]^ jbj{k) < Aj{k)(l — 5) ^ , we obtain the bound for the 
second moment of the Radon- Nikodym derivative: 



E..[^L]<(7^VexpL(5)^^-^^) 



(1-5)= 



1-5 



which completes the proof. 



□ 
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At this point we introduce the following "componentwise" small modeling dias 
conditions: 

(SMBj) Let there exist for some j = 1, . . . ,p , some k{j) < K and some 6^^^ = 
ej0 a constant Aj > such that 

A,(fc(j))<A,, (4.46) 

where Aj{k) is defined by (14 .43 p . 

Definition 2.4.14. For each j = l,...,p the oracle index k*{j) is defined as the 
largest index in the scale for which the (SMBj) condition holds, that is, 

k*{j) = max{k < K : Aj(A;) < A^}. (4.47) 

Theorem 2.4.15. Let the smallest bandwidth hi be such that the first estimator 
ej6i{x) be always accepted in the adaptive procedure. Let k*{j) be the oracle index 
defined by fOTD . j = . Assume (D) , (Hoc), {&) , (*B) , (PC), (W) , 

(©i) and {£,pi'^) . Then the risk between the j th coordinates of the adaptive estima- 
tor and the oracle is bounded with the following expression: 



E|e 6lfc.(,-)(x)-e 6>(x)r (4.48) 



^2 I -^fc*(i)v-/ 



max 



where y9((5) as in Theorem \2.J^.12 



Corollary 2.4.16. Let the basis be polynomial. Under the conditions of the preceding 
theorem, the risk between the adaptive estimator LP'"^{p — 1) of the value of the j th 
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derivative of f at x and the oracle is bounded with the following expression: 

\ max J 

with ip{5) as before. 

Proof. To simplify the notation we suppress the dependence on j in the index k . 
Similarly to the proof of Theorem f l2.4.9p we consider disjunct events {k <k*} and 
{A; > A;*} . Therefore, 

ne]eu*{x)-e]e{x)Y 

= E|eJ^fc,(a;) - ejd{x)\''I{k < k*} 
+ Elejdk^x) - eJ^(x)ri{A? > k*}. 

By Lemma 12.4.101 and the definition of the test statistic T^^, ^ the second summand 
can be easily bounded: 

rnhf^\ ' E|eJ^fc.(x)-ej0(x)ri{fc>F} 

V ^max / 

< E\\B]/,\6k*{x) - d{x)Wl{k > k*} 

^ r/2 

< 3fc. • 

To bound the first summand we use the "componentwise" analog of Theorem l2.4.8[ 
particularly Theorem 12.4. 12[ and this completes the proof. □ 
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2.5 Rates of convergence 



2.5.1 Minimax rate of spatially adaptive local polynomial es- 
timators 

In this section we give some basic information on spatial adaptation and present the 
rate of convergence of the adaptive local polynomial estimator LP"'^(p— 1) of f{x) . 
Let us recall that Donoho and Johnstone in [13] suggested how to measure the quality 
of adaptive estimators. The authors called this approach "the ideal spatial adaptation" 
and defined it as a level of performance which would be achieved by smoothing with 
knowledge of the best "oracle" scheme. The estimator corresponding to this scheme 
is called an "oracle". The adaptive methods try to construct an estimator which 
mimics the performance of the oracle in some sense, for example, in terms of the risk 
of estimation. Inequalities relating the risk of the adaptive estimator to the risk of 
the oracle are usually referred to as "oracle inequalities". The results obtained in 
Section 12.4.41 belong to this family. 

To simplify the representation in this section we consider a univariate design 
in [0, 1] . The generalization to the multidimensional case is straightforward. Fix a 
point X G [0, 1] and a method of localization . In this section we also assume 
that the basis is polynomial and centered at x , that is t/'i = 1 and ipj (t) = {t — 
xy~^/ [j — 1)! with j = 2, . . . ,p . As in Section [L2] we denote for any k = 1, . . . , K 
by 

h{x) = ej0k{x) (5.1) 
the local polynomial estimator of order p — 1 of f{x) corresponding to the A; th scale 
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with the bandwidth hk = hk{x) , or just the LPk{p— 1) estimator of f{x) for short. 
Here ci G MP is the first canonical basis vector (1,0,..., 0)^ . As before we assume 
that 

- < hi < . . . < hk < . . . hx < l 
n 

and therefore that the ordering condition (W) is satisfied. Denote the adaptive local 
polynomial estimator LP°''^{p—l) of f{x) by 

/(x)'='4(x) = e7^(x). (5.2) 

with 6{x) defined by (13.11) . To obtain bounds for the risk of the adaptive estimator 
in [2], [22] and [H] it was suggested to compare the MSE(x) (the Lj,-risk in [22]) 
corresponding to the adaptive estimator f{x) with the infimum over all scales of 
the mean squared risks (the L^, -risks, respectively) of nonadaptive estimators fi{x) , 
I = 1,...,K. That is we compare E,f[\f{x) — f{x)\'^] with the "best" risk of the 
form Kf[\fi{x) — f{x)\'^] . Clearly, for any / by the bias- variance decomposition and 
by (12. 8p we have 

EfU{x)-f{x)\'] = bl{x) + af{x), 
where the variance term is defined by 

af{x)'^EA\eJe,{x)-eJenx)\'] 

and the bias is given by 

kj{x)''^'eJO*{x)-f{x). 

Here 

n 



i=l 
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is a local linear smoother of the function / at the point x corresponding to the / th 
scale, see Section [TT^ for details. The polynomial weights W^^ now are defined by 

Wt^.ix) = ejB;'^,^ (5.3) 
with B; defined by (12. 4p . The columns of the "design" matrix ^ are given by: 

= ^(X, - x) = (1, X, - X, . . . , (X, - xy-'/{p - 1)!)^ . 
With this notation the ideal spatial adaptation can be expressed as follows (see jH]): 

MSE^'^(x) = Jff{blj(^) + ^l(^)} (5-4) 

l<k<K '■' 

where for any k the first summand 

hj{x)= sup \bij{x)\= sup \ej0*i{x) - f{x)\ (5.5) 

l<Z<fc l<l<k 

reflects the local smoothness of / within the largest interval [x — hk,x + hk] , contain- 
ing intervals [x — hi,x + hi] with 1 < I < k . Indeed, the smoothness of a function can 
be defined via the quality of its approximation by polynomials, see [T7] for example. 

—2 

The bandwidth h* = h*{x,W(.), f{-)) providing a trade-off between hi^j{x) and the 
variance term could be called an "ideal" or "oracle" bandwidth. Unfortunately, as it 
is generally in nonparametric estimation, we cannot minimize the right-hand side of 
(15. 4p directly because it depends on the unknown function / . The lack of information 
about / can be compensated by the assumption that / belongs to some smooth- 
ness class, see [26] . This technique in the nonadaptive set-up under the assumption 
that /gS(/3,L) on [0,1] is demonstrated in Section 11.2.21 for the local polynomial 
approximation of order p — 1 = [13 \ . Here the use of (15. 5p or of the SMB conditions 
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allows to adapt not to the functional class but to the smoothness properties of the 
function / itself. 

In the pointwise adaptation framework due to Lepski [38], see also [8], it was 
discovered that the relation (15 ■4p "does not work". This means that an adaptive 
estimator satisfying f l5.4p does not exist. In pointwise estimation one has to pay an 
additional logarithmic factor d{n) for proceeding without knowledge of the regularity 
properties of / . It was proved in and [KJj that this factor d{n) is unavoidable 
and is of order log n , where n is the sample size. In [12] for kernel smoothing in the 
Gaussian white noise model (in our set-up under regularity assumptions on the design 
this is the case of p = 1 , 6 = and 0", = a ), it was shown that d{n) depends on the 
range of adaptation, that is on the ratio of the largest bandwidth to the smallest one 
and that d{n) is not larger in order than log n . This phenomenon can be expressed 
as an increase of the noise level leading to the adaptive upper bound for the squared 
risk (see [H]) in the following form: 

MSE'^'^(x) = ini {blf{x) + al{x)d{n)}. (5.6) 

l<k<K 

This relationship (see [12]) can be written in the form of a "balance equation": 

hkj{x) = C{w)(7k{x)^/d{n) (5.7) 

with 

d{n)= log (^^y (5.8) 

The optimal selection of the constant C{w) provides sharp oracle results. The band- 
width h* = hk* such that 

r = max{fc < K : bkj{x) < C{w)akix)^/d{^} (5.9) 
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is called the "ideal adaptive bandwidth" or just the "oracle bandwidth". 

Before the proceeding with the analysis of the convergence rate, let us point 
out that the weights defined by (15. 3p preserve the reproducing polynomials 

property: 

Proposition 2.5.1. Let x eM. be such that Bi = X]r=i ■ Then 

the weights defined by (15.31) satisfy 

n 

E^m(^) = 1' (5-10) 

i=l 

n 

Y,{X,-xrWl,{x) = Q, m = l,...,p-l, 

i=l 

for all I = 1, . . . , K and design points {Xi, . . . , X„} . 

Proof. By Assumption (W) , if Bi :^ at some point x , then B^ = Bi(x) >- for 
all I = 1, . . . , K , and the assertion follows from the proof of Proposition 1 1 . 2 . 21 □ 

To simplify the study of (15.61) we need to introduce the following assumptions: 

{S^pi') Assume (©i) . There exists a number Aq > such that for any k = 
1, . . . ,K the smallest eigenvalue fulfills Ap(Bfc) > nhkXocr^^a^ for sufficiently 
large n . 

(ilpz') There exists a real number > such that for any interval A C [0, 1] and 
all n > 1 



1 " f 1 

- G A} < aomaxj / dt,-]. 

i=i ^ 



(ilp3') The localizing functions (kernels) Wk,i are compactly supported in [0,1] 
with 

Wk,i{x) = if \Xi - x\> hk. 
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This immediately implies the similar property for the local polynomial weights: 

Wliix) = if \Xi-x\ > hk. 

{S^p4') There exists a finite number Wmax such that 

sup \Wk,iix) \ < Wmax- 
k,i 

Remark 2.5.1. Assumption (-Cpi') is weaker than (-Cpi) because it does not require 
the uniformity in x . 

Remark 2.5.2. Assumption (©i) implies that the conditional number 

2 

= ^ (5.11) 
of the covariance matrix in the known "wrong" model fll.2p is finite. 

Theorem 2.5.2. Assume (W) , (6), (6i) , (-Cpi') -(-Cp4') and that the smallest 
bandwidth hi > ^ . Let f G on [0,1] and let {/fc(x)}^^ be the LPk{p—l) 

estimators of f{x) with p — 1 = [/3J . Then for sufficiently large n and any 
satisfying hx > ■ ■ ■ > > . . . > hi , k = 1, . . . , K , the following upper bounds hold: 



hkj{x)\ < C2k{T.) 



{p-l) 



a 



ii^) < (1 + 5) 



2/^\ ^ /I I STN ^max 



2 



nhkXo 

with C2 = 2Wmax(^oV^/ Xq '^'^^ ^ ^ [O5 1) ■ 

Moreover, the choice of a positive bandwidth h = h*{n) (see (15.151) for the precise 
formula) in the form: 

h*(n) = O 



d{n] 



n 



2/3 + 1 
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provides the following upper bound for the risk of adaptive estimator: 

\hR sup lKf[^^J-'\f{x)-f{x)\']<C, (5.12) 

/eS(/3,L) 

where 

,.^O0f) ,5.3) 

is given by ( ]5.16p and the constant C is finite and depends on f3 , L , cr^in , CTmax ? 
P, Wmax and ao only. 

Remark 2.5.3. The bound for cr|(x) is simple than the corresponding one from The- 
orem [L23] due to the assumption of normahty of the vector of errors ( e ~ A/" (0, /„) ) 
in the models (fLTjl -fOD. 

Remark 2.5.4. Recall that in [32] it was shown that the "adaptive factor" d{n) 
cannot be less in order than log [hxh^^) . 

Proof. The bound for \bij{x)\ at the point x is obtained as in the proof of The- 
orem 11.2.41 by application of the second assertion of Lemma II. 2. 3^ so we skip some 
details. By Proposition 12.5.11 and the Taylor theorem with Tj such that the points 
TjXj are between Xj and x , and utilizing Assumption (ilps) we have: 

n 



L 



1=1 

n 



X 



1=1 



Under the assumptions of the theorem the sum of the polynomial weights can be 
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bounded as follows: 

n n 



1=1 i=l 

n 



< /t(L) aomax|2,— -| 

Ao nhi 

- ' 

^0 

and the first assertion is justified in view of 

J- . X def |L / M / /^N 2aoWmaxA/eao Lh^l /r-TA\ 

bkjix) = sup \bij{x)\ <4S) — . (5.14) 

l<l<k Ao [p — i)\ 

To bound the variance just notice that, because is symmetric and non-degenerate, 
by (£pi') for any 7 G it holds: 

7^B,i7<^ll7f. 
nhkXo 

Then under Assumption ((5) by ( ]4.10p for the variance term we have: 

(tI{x) = e^Var^fcCi 

< (1 + 5): 



2 

'^max 



nhkXo 
By dSD, 

MSE'^'^(a;)< inf jC^/if + ^1^} 
- i<k<K nhk ^ 

with C2 = {C2L k{T,) / {p—iy.y and Ci = {l + 6)a'^^^XQ^ . The choice of a bandwidth 
of the form: 
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minimizes the upper bound for the MSE°'^(a;) and provides the rate tpn w.r.t. the 
square loss function and over a Holder class L) : 

^n = C^j^J ((1 + 5VL.^J • (5.16) 

Here C and C depend only on Wmax , clq , Aq and f3 . □ 



2.5.2 SMB, the bias-variance trade-off and the rate of con- 
vergence 

The choice of the "ideal adaptive bandwidth" usually can be done by fl5.9p . In [66] it 
was shown that the small modeling bias (SMBl) condition (14.461) can be obtained 
from the "bias-variance trade-off" relations. Unfortunately, to have the "modeling 
bias" A{k) = 0{1) (this is Ai(A;) in the present framework) one should apply the 
balance equation (15. 7p or (15. 9p without the "adaptive factor" d{n) , see equation (3.5) 
in [66]. In the Gaussian regression set-up (example 1.1 in [66] ) under smoothness 
assumptions on the regression function / G S(/3, L) this results in a suboptimal rate 
in the upper bound for the MSE(a;) : 



n 

0[L^('2^)*'] (5.17) 



with 7 = > 1 . Notice that, due to the normalization by y Var[6';] , the adaptive 
procedure used in [66] coincides with Lepski's selection rule from [3H] and [12]. Be- 
cause local constant Gaussian regression under a regularity assumption on the design 
is equivalent to the Gaussian white noise model, it is known from these papers that 
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this procedure is rate optimal with the minimax rate ipn = C) 2/5+1 {^^^)^^^^^ , 
that is, 7 should be equal to 1 . This shows that the method of obtaining the upper 
bounds from |66j and generalized in the present work should be refined. This lack of 
optimality was also independently noticed in [58] . 

Now we will demonstrate that: (1) the definition of the "ideal adaptive band- 
width" (15.91) with d{n) = 1 implies the {SMBj) conditions; (2) for Lepski's selection 
rule in our framework we have the same rate for the upper bound of the risk as in 
equation (15.171) . 

Notice that for the method of local approximation using of the polynomial basis 
centered at x the definition of the "ideal adaptive bandwidth" (15. 9p can be easily 
generalized for the estimators of the derivatives of / defined by (I4.33p . Then, given 
a point X and the method of localization W(^.'j , for any j = 1, . . . ,p the formula (15.91) 
reads as follows: 



k*'{j) = max{k < K : &fcju-i)(x) < Cj{w)ak{x)\/d{n)}, (5.18) 

where Cj{w) is a constant depending on the choice of the smoother , 

= sup \eje;{x)-f'^^-'\x)l 
i<i<k 

al{x) = Varj,Ejej5fc(x)], 

and /'•^^ stands for the function / itself. To bound the "modeling bias" Aj(k) we 
need the following assumption: 

{&kj) There exists a constant sj > such that for all k < K 

^k] ^ -^J^fcj.dmg' (5.19) 
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where ^k,j4iag = diag ( Var0^s[ej0i(a;)], . . . , Var0_s[ej0fc(a;)]) is a diagonal matrix 
composed of the diagonal elements of j . Thus we have the following result: 

Theorem 2.5.3. Assume (03) , (6) and i&kj) . Let the weights {wk,i{x)} sat- 
isfy fl6.3p . Then for any given point x, smoothing function W(^.) and j = l,...,p 
the choice of k{j) = k*{j) defined by the relation f lS.lSp with d{n) = 1 implies the 
{SMBj) condition Aj{k{j)) < Aj with the constant Aj = SjCf{w){l+6){l-UQ^y^ . 

Proof. Consider the quantity bjik)^'S'f^j^^-^gbj(k) . Suppose that ej 6(x) = f^^~^\x) . 
In view of relation (16. 3 p for the weights {wi^i{x)} the form of the matrix T^kj^iag is 
particularly simple: 

^k,j,diag = diag(ejB^^ej, . . . , ejB^^ej). 
Then by (®) and flirTOj) 



bAkV^kid^agbAk) = ^ 



1 ^jBr^e, 



2 ^ 1 

< {bkju-^)ix)YY.^f^ 



{bkju-^){,x)) -(k-l) 



- al{x){l - u-,') ■ 
By (I5.18P with d{n) = 1 the choice of k = k*{j) implies (^fejo-i) (a;))^ < Cj{w)al{x) 
Thus 

bjiky^^id^agbAk) < (1 + S)C]{w)il - u,'r' 

and 

A,{k) = b,{kYi:,]b,{k) < s,C^^{w){l + 5){l 
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□ 

Now we will show that the "oracle" risk bound from Corollary 12.4.161 delivers 
at least the suboptimal fl5.17p rate of convergence for the upper bound of the risk 
w.r.t. the polynomial loss function and over a Holder class L) . For simplicity 
we restrict ourselves to the case of the univariate design. We study the quality of 
the LP"'^{p — 1) estimator f{x) of f{x) under the assumption that / G 
on [0, 1] with [/3J =p-l. 

Denote by k* the index k*{j) with j = 1 from Theorem 12.5.31 and the corre- 
sponding bandwidth h^* by h* . Then the following asymptotic result holds: 

Theorem 2.5.4. Assume (53) , (D) , (Hoc) , (PC) , (6) , (6i) , (©^1) , (£pi') - 

(£p4') , (W) , and that the smallest bandwidth fulfills hi > ^ and is such that the 

first estimator fi{x) = ej0i{x) is always accepted by the adaptive procedure. Let the 

weights {wk,i{x)}^^i satisfy (16. 3p . Assume that for x G (0, 1) there exists 6{x) G W 

such that f{x) = ej0{x) . Let f G T,{I3,L) on [0, 1] with p — l= [/3J . Then for the 

risk of the adaptive LP"''^{p — 1) estimator f{x) of the function f{x) at the point 

X G (0, 1) the following upper bound holds: 

^ r /he:^ n\ 2/5+1 

E\f{x) - /(a;)r < CLwri I j (1 + ^(1)), n -> 00 

with 7 = and the constant C depending on /3 , CTmin > ^max ? P ; Wmax , Aq and 
ao only. 

Proof. By the triangle inequality and the inequality (a + by < Cr{a^ + b"^) with 
Cr = 2^~^ , T > 1 and = 1 for r G (0, 1) , for any k = 1, . . . , K we have 

\f{x) - /(x)r' < a [|/(a:) - hix'W + \h{x) - f{x)\^ 
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Let 0{x) G be such that f{x) = ej6{x) . Then, because a G (0, 1] , by Theo- 
rem [532] and Corollaries 12.4.131 and 1^.161 we have 



By Theorem 12.4.71 3^* is not larger in order than K x logn . Then for 5 = o(-^) = 

V log n / 

E|/(x)-/(x)r<c(^i^y (1 + 0(1)), 00. 

The precise constant can be extracted easily, but because we anyway will get only a 
suboptimal upper bound, in the following we will not care about the constants. The 
balance equation f lS.lSp with j = 1 and d{n) = (9(1) and the bounds for the bias 
and variance from Theorem 12.5.21 suggest the choice of bandwidths in the form: 

leading to the following bound for the risk: 

r/3 



n 



□ 



2.6 Auxiliary results 

Lemma 2.6.1. Pivotality property 

Let (W) hold. Under for any k < k, the risk associated with the adaptive 
estimator at every step of the procedure does not depend on the parameter 6 : 

^e\{Gk — GkY^k{Ok — Gk)\' = ^o\{Qk — Qk)^^k{Gk — Ok)Y , 
where Eq denotes the expectation w.r.t. the centered measure A/'(0, S) or A/'(0, Sq) . 
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Proof. After the first k steps 0k coincides with one of 6^ , m < k , and this event 
takes place if for some I < m the statistic Ti^^n+i > li ■ Because the hypothesis i/^ 
imphes Hm+i for all m < k and in view of the decomposition (12. 7p it holds 

{Ti^rn+i > U for some / = 1, . . . , m \Hm+i] 
= - 0m+i)^Bi(6i - dm+i) > 3/ for some / = !,.. .,m\Hm+ij 



> 3/ , I <m 



with £ ~ A/" (0, /„) . The probability of this event does not depend on the shift , 
so without loss of generality 6 can be taken equal to zero. The risk associated with 
the estimator dk admits the following decomposition: 

k-l 

E0\iek-ekVBk{0k-0k)\'' = Y,^o\(ek-emVBk{Ok-e^Wi{dk = d^}. 

m=l 

Under Hk for all m < k the joint distribution of {Ok — 6„iYBk{0k — Om) does not 
depend on by the same argumentation. □ 

Lemma 2.6.2. The matrices S and Jk ® are positive semidefinite for any 
k = 2,...,K . 

Moreover, under the condition ((3) with the same 5 the following relation similar 
to {&) holds for the covariance matrices and o of the linear estimators: 

(l-5)Sfc^Sfc,o^(l + 5)Sfe, k<K. 

Proof. Symmetry of Jk and S , (respectively, Sq ) implies symmetry of Jjt ® S , 
(respectively, Jfc ® So )• Notice that any vector 'jnk € -K"*^ can be represented as a 
partitioned vector = ((7!!^ , (tS)^, • • • , (7!?)^) , with j^^l e R\ I = 1, . . . , k . 
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Then 

7„\(4®S)7„.= (X:72)^S(X:72) =7:S7n, (6.1) 
1=1 1=1 

where 7^ =^ SzLi7nfe ^ • Because E :^ , this imphes 7^S7„ > for all 
7„ 7^ . But even for 'jnk 7^ , if its subvectors {7^'^^} are linearly dependent, 7„ 
can be zero. Thus there exists a nonzero vector 7 such that 'j^ {Jk ® S)7 = . This 
means positive semidefiniteness. 

The second assertion follows from the observation that the condition (6) due to 
the equality (16. ip also holds for the Kronecker product 

(l-(5)Jfe®S^ Jfc®So^(l + 5)Jfc®S. (6.2) 

Therefore 

(1 - 5)BkiJk ® S)Dj ^ Dfc(J, ® So)Dj ^ (1 + (5)D,(J, ® S)DJ. 

□ 

Lemma 2.6.3. Suppose that the weights {wi^i{x)} for every fixed x G iR*^ satisfy 

wi^i{x)wm,i{x) = wi^i{x) , I <m. (6.3) 

Then under the conditions (D) , {£,oc) , (JB) the covariance matrix defined by 
fl4.17p is nonsingular with 

k 

detEfc = detB^i JJdet(B-\ - B-^) > , k = 2,...,K. (6.4) 

1=2 

Remark 2.6.1. The condition 06.31] holds for rectangular kernels with nested sup- 
ports. 
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Proof. The condition ([63]) implies W,SW„ = dieig{wi^iWm,i(Ti ^ • • • , wi^riWm,nCr~„ 



T-D-l 
m 



Wi for any I < m . Thus, the blocks of simplify to ASD^ = B,"^*W/SW„*^B 
B;~^^'Wi^'^B,^j^ , and has a simple structure: 

/ T3-1 T3-1 T3-1 T3-l\ 



Bg^ 



Ba^ B2^ Bg^ B^^ 



yB^' B,^ B-1 ... B-^y 
Then the determinant of coincides with the determinant of the following irre- 
ducible block triangular matrix: 

1 - B2 B2 - B3 B^_^ - B^ B^ 

B2 ^ - B3 ^ B,^\ - B^ ^ B^:^ 

detSi 





















B.^ 



implying 



det Sfc = det(B^^ - B^^) det(B^^ - B^^) • . . . ■ det(B^_^i - B^^) det 

Clearly the matrix 1]^ is nonsingular if all the matrices B;l\ — B^"^ are nonsingular. 
By (2)) and (£oc) B/ :^ for any /. By (03) there exists Uq > 1 such that 
Bl y uqBi^i therefore B^"_\ - B^~^ ^ (1 - l/uo)Bl}-^ y B^"_\ ^ . □ 

Lemma 2.6.4. Under the alternative the moment generation function (mgf) of the 
joint distribution of di, . . . , Ok is 



Eexp {7'^(vec0A' - vec©^)} = exp <j i7'^S/^,o 7 



(6.5) 
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Thus, provided that S^-.o >~ ^ , it holds vec @k ~ A/" (vec 0^, '^k,o) ■ 

Similarly, under the null, if >- , the joint distribution of vec @k is M (vec @Ki '^k) 
with mgf 

Eexp |7^(vec ©/^ — vec ©i^-)} = exp |-7^Sa' T^- (6.6) 

Proof. Let 7 G M^^ be written in a partitioned form 7""^ = (7^,...,7j) with 
subvectors 7; G iR^ , / = 1, . . . ,K . Then the mgf for the centered random vector 
vec0i^ - vecej^ G R^^ due to the decomposition di = 0*i + DiT^J'^e with 

can be represented as follows: 



E exp {7T (vec Ga' - vec 0^^) } = E exp { ^ 7^^ (0; - fi'H } 

(=1 

if if 
Eexp {Y,ljDi'L]^^e] = Eexp { ( ^,^70^^;/'^}. 



A trivial observation that ^f=iDj 1i is a vector in iR" and Sg^^s ~ A/'(0,So) by 
(11. ip implies by definition of Sif,o the first assertion of the lemma, because 

Eexp { ( f: Dj^^flll'^e] = exp |i ( f: Dj^^fll, ( J] A^O | 
«=i /=i 1=1 ^ 

= exp|^(Dj7)^(J^®Eo)Dj7| =exp|^7TSK,o7|, 

where D^: is defined by fl4.39p . □ 

Lemma 2.6.5. The Kullback-Leibler divergence between the distributions of vec 0^ 
under the alternative and under the null has the following form: 



2KL(iP^^„, ry 2E,,s„ log (— ^) 
detSi 



A(A;) + log ( ) + tr(S, iSfc,o) - pk, (6.7) 
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where 

b{k) = vec0^-vec0fe, (6.8) 
A(A;) - b{kyj:^^b{k). (6.9) 

Proof. Denote the Radon-Nikodym derivative by dlP^ j.^/dlPg j. . Then 

+ i||St"'(!;-vec0t)||' (6.10) 

can be considered as a quadratic function of vec . By the Taylor expansion at the 
point vec Ql the last expression reads as follows: 

log(Z.fo)) 4log(^) - l||E-%-vece;)IP 

Then the expression for the KuUback-Leibler divergence can be written in the follow- 
ing way: 

1HL(FJ,j:„, Pl^) S m,,^ log (Zt) 

where ^ ~ (0, Ipk) ■ This implies 

2KL(iP^^,^„, Pl^) = A(A;) + log (^^) + tr(S-iS,,o) - pA;. (6.11) 

In the case of homogeneous errors with (Tq^i — Uq and = cr, i = 1, . . . , n , the 
calculations simplify a lot. Now 
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with a pk X pk matrix defined as 

where A = (*W;*^)"^*Wi , / = 1, . . . , /c , does not depend on a . Then A(fc) = 
(T-2Ai(A;) , with Ai(A;) = h{k)'^Y^%k) , det S^/ det Sfc,o = (^V^o)^'' and the ex- 
pression for the Kullback-Leibler divergence reads as follows: 

KUFl^^,P,\^) = ^Hog(^) + iA(A;) + ^(4-l) 

(To ^ Z (7 

= P^log(^) + ^KfcrV,^W + ^(^-l), (6.12) 
implying the same asymptotic behavior as in fl4.25p . 

□ 
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Chapter 3 

Dependence on the dimension for 
complexity of approximation 
of random fields 

In this chapter we consider the e -approximation by ra-term partial sums of the 
Karhunen-Loeve expansion of (i-parametric random fields of tensor product-type in 
the average case setting. We investigate the behavior as — )■ oo of the information 
complexity n{e,d) of approximation with error not exceeding a given level e. It was 
recently shown by Lifshits and Tulyakova [H] that for this problem one observes the 
curse of dimensionality (intractability) phenomenon. We present the exact asymp- 
totic expression for the information complexity n{e,d) . 
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3.1 Introduction and set-up 

Suppose we have a random function X{t), with t in a compact parameter set T, 
admitting a series representation via random variables and the deterministic real 
functions ipk, namely, 

oo 
k=l 

where the series converges in the mean and a.s. for each t E T. A more precise 
description will be given later. For any finite set of positive integers C N let 
Xxit) = J2keK ^kfk(t). In many problems one needs to approximate X, for instance 
under the L2-norm with a finite-rank process X^- Natural questions arise: How large 
should K be in order to yield a given small approximation error? Given the size of 
K, which K provides the smallest error? 

In this chapter we address the first of these questions for a specific class of random 
functions, namely tensor product-type random fields with high- dimensional parameter 
sets. The tensor product-type field is a separable zero-mean random function X = 
{X{t)}t^T, with a rectangular parameter set T C M'^ and covariance function K-^'^^ 
which can be decomposed into a product of equal "marginal" covariances depending 
on different arguments. Namely, let T = [0, 1]'^ and 

d 

IC^'\s,t) = l[lCi{si,ti) (1.1) 

1=1 

for all si,ti e [0,1], s = {si,...,Sd), t = {ti,...,td). Obviously, the integral operator 
with the kernel (II. ip is the tensor product of the integral operators with the kernels 
ICi{si,ti). 
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Let {Aj}j>i be a nonnegative sequence satisfying 

oo 

J2^"<oo (1.2) 



1=1 



and let {^Pi}i>o be an orthonormal basis in L2[0, 1]. Consider a family of tensor 
product-type random fields 

X={X('^)(t),tG[0,l]'^} , ci=l,2,.... (1.3) 

According to the multiparametric Karhunen-Loeve expansion (see p] for details), 
the family (ll.3p can be given by 

d d 

keN'* 1=1 1=1 



oo oo 



ki=l kii=l 

where the series converges a.s. for every t = {ti,...,td) e [0,1]'^. The collection 
{^k} is an array of noncorrelated random variables with zero mean and unit variance, 
and A^^ and ipk^ are, respectively, the eigenvalues and eigenf unctions of the family of 
integral equations 



^^li^kX^i) 



/ /Ci(si,t/)v9fc,(s/)ds/ , tie[0,l], / = l,...,c?, 
Jo 



corresponding to the "marginal" covariance operators. Clearly, under assumption fll.2p 
the sample paths of X'^'^^ belong to L2([0, Vf) almost surely and the covariance oper- 
ator of X^"'^ has the system of eigenvalues 

d 

A^ = n^^ kGN'^- (1-5) 
1=1 

As was mentioned in [60], the Karhunen-Loeve expansion or the proper orthog- 
onal decomposition of random functions was introduced independently and almost 

83 



simultaneously by Kosambi [35], Loeve [36], Karhunen [27], [28], Obukhov [50], and 
Pougachev [5^ . 

In what follows we suppress the index d and write X{t) instead of X^'^\t). For 
any n > 0, let X„ be the partial sum of (11 ■4p corresponding to n maximal eigenvalues. 
We study the average case error of approximation to X by X„ 

e(X,X„;d)= (E||X-X„||^)'/', 

as d — 7- oo. 

It is well known (see, for example, [9], [36] or [59]) that X„ provides the minimal 
average quadratic error among all linear approximations to X having rank n. Because 
we are going to explore a family of random functions, it is more natural to investigate 
relative errors, that is, to compare the error size with the size of the function itself. 
Denote the "marginal" trace by 

oo 

dof 



i=l 

Then 

The average case information complexity for the normalized error criterion reads 
as the minimal number of terms in X^ (or, equivalently, of maximal eigenvalues, if 
they would be ordered) needed to approximate X with the error not exceeding a given 
level e: 

n{e, d) = min |n : ^^^^^^'^j^ < ^| = niinjn : E||X -Xn\\l< e^h"}. 
(]E||X||2) 

The study of n{e, d) we are interested in here belongs to the class of problems 
dealing with the dependence of the information complexity for linear multivariate 
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problems on the dimension, see the papers of Wozniakovski [76], [77], [78], [79] and 
the references therein. 

Generally, the linear tensor problems with A2 > for the normalized error crite- 
rion are intractable, since 

n(e,rf) > {l-e^){l + ^Y for all eE [0,1) 

is exponential in d and the curse of dimensionality takes place, see Theorem 6.6 
of [l9]. However, it is interesting to know the exact behavior of the information 
complexity n{6, d) even in this case, because this kind of negative result can help in 
lifting the curse of dimensionality. 

It was suggested in [H] to use an auxihary probabilistic construction for studying 
the properties of the deterministic array of eigenvalues fll.Sp . We follow this approach. 

Consider a sequence of independent identically distributed random variables {t//} , / 
1,2,... with the common distribution given by 

P(t/; = -logA,) = ^, z = l,2,... (1.6) 

Under the assumption 

00 

^llogA^I^A^ < 00, (1.7) 

i=l 

the condition E|f//p < 00 is obviously satisfied. 

Let M and denote, respectively, the mean and the variance of f//. Clearly, 

°° A^ 
M = -^logA,^, 

i=l 



a 

i=l 



85 



Then the third central moment of Ui is given by 

«3 -Mf = -J2 (log A.)' ^ - 3Ma2 - 

1=1 

If fll.7p is verified, we have \M\ < oo, < < cxd, and |a| < oo. 
In what follows the explosion coefficient 



£ Ae^^ (1.8) 



will play a significant role, because its contribution into the "curse of dimensionality" 
is the largest. It was shown in [H] that by concavity of the logarithmic function 
£ > 1, except for the totally degenerate case when the number of strictly positive 
eigenvalues is zero or one. In other words, £^ = 1 if and only if cr = 0. Henceforth, we 
will exclude this degenerate case. 

The following result was obtained in [H], Theorem 3.2. 

Theorem 3.1.1. Assume that the sequence , i = 1,2, . . . , satisfies the condition 

oo 

J^llogAil^A^ < oo. 

i=l 

Then for every e G (0, 1) we have 

\ogn{e,d) - d\og£ 
lim — = 2q, 

where the quantile q = q{e) is chosen from the equation 

l-$(^i)=£2 (1.9) 

with $(■) denoting the standard normal distribution function. 
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The authors of [H] conjectured that under further assumptions on the sequence 
{Aj} one can prove that 

n[e, a) ^ — , a — oo. 

y/d 

We will show that even a stronger statement holds. 

3.2 Main result: the exact intractability rate in 
increasing dimension 

It turns out that two different cases depending on the nature of the distribution of 
Ui should be distinguished. The proof and the final result depend on whether this 
distribution is a lattice distribution or not. 

Recall that one calls a discrete distribution of a random variable U a lattice 
distribution, if there exist numbers a and h > such that every possible value of 
U can be represented in the form a + uh, where u is an integer. The number h is 
called the span of the distribution. In the following, when studying the lattice case, 
we assume that h is the maximal span of the distribution; i.e., one cannot represent 
all possible values of Ui in the form b + uhi for some b and hi > h. 

Definition (11. 6p yields that the variables Ui have a common lattice distribution if 
and only if Aj = Ce^"*^ for some positive C, h and rij G N. We call this situation the 
lattice case and will assume that h is chosen as large as possible. Otherwise we say 
that the nonlattice case holds. 

By f{d) = o{g{d)) we mean that Ymvd^ao = 0. In particular, f{d) = g{d) (1 + o(l)) 
means that hm^^oo = 1- 
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Theorem 3.2.1. Let the sequence {Aj} , i = 1,2, . . . , satisfy (11. 7p . 
Then for every £ G (0, 1) it holds 

n{e, d) = K (t){-) £V^v^ d-^'^ (1 + o(l)) , d^oo. 



where 



K 



in the lattice case, 



1 



„ otherwise, 
and the quantile q = q{e) is defined in (11. 9p . 

Remark 3.2.1. One can see that the complexity of approximation increases expo- 
nentially as c/ — > oo. This phenomenon is referred to as the curse of dimensionality 
or intractability; see, e.g., [59] and [77]. The notion of the "curse of dimensionality" 
dates back at least to Bellman [S]. 

Remark 3.2.2. By I'Hopital's rule, 

lim ^ 



and thus the relations for K are in accordance as — ?► 



3.3 Proof of the main result 

This section presents a proof of Theorem 13.2. 1[ 
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Proof. Let C = C(^) ^) be the maximal positive number such that the sum of eigen- 
values satisfies 

keN'':Ak<C 

Define a lattice set in N*^ in the following way: 

def 



d 



1=1 

Since Ak > for any k e A, one can write 



n{e, d) = #A = J] 



keA 



A 



d d 



J2 A'^ exp { - 2 E log A,, } II F{Ui = - log A^J 



d d 



= A'^Eexp{2 5^t/;}l{Et/^<-logC}. 

1=1 1=1 

For centered and normalized sums 

r,^_ Et=iUi-dM 

we have 



1=1 

where 



a 

{J]C/,<-logc} = {Zj<#}, 



9 = = (3.1) 

We show now that 9 has a useful probabilistic meaning in terms of {Ui} and of 
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their sums. Applying Lemma 3.1 of [H] we have for any d G N and 2; G 

keN'*:Ak<z \l=l 

log z + dM 
= A''¥{Z,,>e,), 

where 

^ _ log z + dM 
(j\fd 

Fix e G (0, 1). Observe that 

keN'^:Ak<2 



if and only if 



Therefore, Q = 6{e, d) defined by (13. ip is the (1 — e^)-quantile of the distribution of 
Zd, namely, 

e{e,d) = mm{e: ¥{Zd> 9) <e^} 

= mm{e : F{Zd<0)> 1-e^}. 

Let q = q{e) be the quantile of the normal distribution function chosen from f ll.9p . 
Then in view of the central limit theorem 

e{e,d) ^ ^ , d^oo, (3.2) 
a 

for any fixed e G (0, 1). 
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Now let us return to the information complexity. We obtain 

n{e,d) = £'^Eexp{2aVdZd}l{Zd<9} 

re 

= S'^ exp{2(rVd9} exp{2aVd{z - 9)} dFd{z), 

J — oo 

where Fd{z) = V{Zd < z) and £ is defined as in fll.Sp . 
Denote 

^d{z) = ex^{2aVd{z - 9)} 
and integrate by parts the integral 

^<d{z)(i[Fd{z)-Fd{9)]= f [-Fd{z) + Fd{9)]A^<d{z). 

I J — oo 

From now on we have to distinguish the lattice and nonlattice cases. 
3.3.1 Nonlattice case 

In the following part of the proof we will assume that the distribution of {Ui} is not 
lattice. This is true in the most interesting cases, such as the Brownian sheet (the 
Wiener- Ghent sov random field), the completely tucked Brownian sheet (the Brown- 
ian pillow), and the d-variate Hoeffding, Blum, Kiefer and Rosenblatt process (see 
Appendix 13.41 for details). 

In view of (11.71) we are able to apply the Cramer-Esseen Theorem (cf. [21], sec- 
tion 42, Theorem 2; [5l], Chap. V, section 5.7, Theorem 5.21; [53], Chap. VI, 
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section 3, Theorem 4). It leads to 

[-F,{z)+Faie)]d^,{z) 

[-$(^) + $(e)]d^,(z) (3.3) 
where $(■) is the standard normal distribution function and 

J — oo 

6a^\'2nd 



(the last equivalence is provided by (13. 2p ). 

Since (\^d{z) = 2a\/d'^d{z)dz, the integral I2 is given, after a change of variable, 
by the following expression: 

«^ r^n y ^2„.„ r 1^ y 

with y = —^/d{z — ^). 
For any d = 1,2, 



^ (,_^).exp{-i(.-X)^}exp{-2.,}d, 



°A''-;i)'^^''{4(''-^n^'i^i+^'^' 
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This estimate gives us the majorant required in the Lebesgue dominated convergence 
theorem. Using fl3.2p and passing to the hmit in the integral, we obtain, as d — t- oo. 

Similarly, 

hid, 6) = f exp (1 + 0(1)) . 

Thus we obtain that \/dl4 = \fd{l2 — h) (1 + o(l)), and hence, -^2 — -^3 — -^4 = o ( 4^ ) . 



Consider the main integral Ji: 



Then 



as asserted. 



—00 



1 /"^ 

^ / exp{2(TV^(2 - ^)} exp{-2V2} dz 
27r j-00 

/ exp{-i(^^--|)'}exp{-2a,}dy 



\/2Ttd Jo 



^ expj-^ Wl + o(l)) , rf^oo. (3.4) 



2a^/M I 2a 



^'^exp{2gv^} 1 / <f , (.^^ 



3.3.2 Lattice case 

Now we will proceed under the assumption that the random variables Ui have a 
lattice distribution. Let possible values of the random variable ?7/ be a + vh, v = 
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0, ±1, ±2, where a = M+a is a shift and h is the maximal span of the distribution. 
Therefore, all possible values of have the form 

da + uh 



(T\fd 

Introduce the function 



V = 0,±1,±2, 



S{x) = [x]-x + ^, 

where [x] denotes, as usual, the integer part of x, and consider 

^ , , h i x(T^^ — da 
Sd{x) = - S \ 

Let Fd{z) be as above. Then under assumption ( 11. 7p Esseen's result (see Theo- 
rem 1 page 43 in pTj) yields 

e-^'/2 fSdiz) a%z^-l)\ f 1 



uniformly in z. 

Comparing with (13. 3p . we observe that one needs only to evaluate the additional 
term 

J = -i= [-5,(z)e-^'/2 + S,ie)e-''/']d^,iz) 
y/zird J-oo 

= J' ^,iz)d (^,(z)e-^'/2j = _ J2 + J3, 



where 



J2 = r ^d{z)S,(z)ze-'"/'dz 

-00 



V2TTd 

and J3 is a "discrete part" , which is defined in the following way. Notice that S{x) is 
a periodic function with period one; therefore Sd{x) possesses the period h/aVd and 
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has jumps at points { ^^^° , k e Z}. If the point 9 belongs to this lattice, then there 
exists an integer k' such that 6 = ^'^^ ■ Hence, one can integrate the discontinuous 
part of the integral J with respect to the measure -5 kh+da and obtain 



k' 

1 h v-^ ^ fkh + da\ r lfkh + da\'^^ 



We start by estimating Ji. At the points where the derivative S'^{z) exists, one 
can easily calculate that 

h /zay/d — da^ 



S',{z)^^s{^^^^)^-Vd. 

Therefore, as in the nonlattice case, by the Lebesgue dominated convergence theorem 
we have 

Ji = [ exp{2(7V^(z-^)}exp{-zV2}d^ 

VStTO? ,/-oo 

^ ^dl -p{-^^(^-^)>xp{-2.,}d, 

r^^''P{"^}^^^''^^^^ , c?^oo, (3.5) 



2aV2Trd 

which yields \/dJi — —Vdh (1 + o{l)). 

As for the integral J2, this one, for sufficiently large d, becomes negligible. Indeed, 
J2 = -1= / exp{2(7V^(^ - ^)}S'd(^)^exp{-^V2}d^ 



^ 3h 



2adV27i 
3h 



Aa^dV^n 



— >■ 00. 
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And, of course, J2 = o (^-^ 

Now we consider the essential summand 



1 h 



V27rd cr 
1 h 



fc=— 00 
k' 



2aVd( ^^h±^ _ I exp | 

1 / kh + da\'^ 



1 fkh + da\'^' 



1 h 
1 h 



5:exp{2M^-^0}exp{--( 

k=—oo * 

00 1 Ih 



1 



1=0 
h 



aVd (1 - e-''^) V2 



1 I q 

exp 



TT 



2a2 



;i + 0(1)) , d^oo. 



We obtained 



a/^Js = y/d- 



2h 



(1 - 

Putting together (13.41) . (13. 5p . and (13. 6p . we get 



/i(l + o(l)). 



n(e, d) 



a 



Vd (1 



exp 



2a2 



;i + o(l)), c?^oo. 



(3.6) 



□ 



3.4 Appendix. Examples of tensor product-type 
random fields 

This section contains some examples of random fields to which the above general 
result can be applied. 
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3.4.1 Wiener-Chentsov random field 

The Wiener-Chentsov field or the Brownian sheet (see [33]) is a zero- mean Gaussian 
random function W^'^^ with covariance function equal to a product of the covariance 
functions corresponding to the Wiener process W: 

d 

]Cw(d){s,t) = ]^min{si,t/}, s = (si, ...,Sd), t = (ti, ...,td) G T. 
1=1 

Therefore the marginal eigenvalues have the following form: 

A^^, = (7r(z - l/2))-^ z = l,2,.... 

3.4.2 Completely tucked Brownian sheet 

The completely tucked Brownian sheet (the Brownian pillow) is a zero-mean Gaussian 
random function B^"^^ with covariance function equal to a product of the covariance 
functions corresponding to the standard Brownian bridge B{t) = W{t) — tW{l), 
namely 

2 

/Cb(2) (s, i) = n ("^i'^i'^'' " ^i^^) 5 -5, ^ e [0, 1]^. 
1=1 

Respectively, the marginal eigenvalues (see P]) are equal to 

A|;. = (v^^)-^ 2 = 1,2,... . 

In the literature different terms are in use for this random field. In [73] the term 
"completely tucked Brownian sheet" is used; in [12] "tied-down Kiefer process" is 
used; in [51] this field is called "the Brownian pillow" . 

The notion of "completely tucked Brownian sheet" and its generalization for the 
case d > 2 was introduced by Blum, Kiefer, and Rosenblatt [7] as the limit distri- 
bution for a functional of an empirical process occurring in nonparametric testing of 
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independency, the so-called "independence empirical process" (see [73]). Therefore, 
the (i-parametric generalization of the completely tucked Brownian sheet is often re- 
ferred to as the "d-variate Hoeffding, Blum, Kiefer, and Rosenblatt process" (see, for 
example, [21]). The mention of Hoeffding's name in the term is motivated by the 
fact that the test studied in [7] is equivalent to the one suggested earlier by Hoeffding 
in 123]. However, the limiting distribution, the covariance function, the eigenval- 
ues and the eigenfunctions of the respective integral equation were obtained in [7]. 
Higher-dimensional generalizations were later treated in [15] and |13] . 

3.4.3 Centered Gaussian processes 

In some statistical problems it is convenient to use centered empirical processes and 
corresponding limiting Gaussian processes. 

For any Gaussian process X = {X{t)}, t E [0, 1] we define the centered process 



process, was introduced in [7^ for nonparametric goodness-of-fit testing on a circle. 
Watson showed that the covariance function is given by 




The centered Brownian bridge B, also referred to in the literature as the Watson 



/Cj3(s, t) = min{s, t} - st + -{s"^ + t'^ - s - t) + 



1 



s, [0,1], 



12 



and the covariance operator with this kernel has a double spectrum, i.e.. 



'B;2i 



Xl 



{2ni) 



-2 



I = 



1,2,... . 



The covariance function of the centered Wiener process W has the form 



/C^(s, t) = mm{s, t} + -{s^ + t^)-s-t + -, s, t G [0, 1], 
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and the corresponding eigenvalues coincide with those of the standard Brownian 
bridge, i.e., 

A|.,, = A|^, = (7r^)-2, ^ = 1,2,.... 

This is in accordance with the well-known equality in distribution for the L2-norms 
of the Brownian bridge and the centered Wiener process; see [1]. 
The centered integrated Brownian bridge 



B{t) = B{t) - [ B{u)du, 
Jo 



where 

B{t) = [ B{u)du, t G [0, 1] 
Jo 

was considered in a framework of goodness-of-fit testing and small deviation proba- 
bilities under the L2-norm in [22] and [1], where its covariance function 

^ , , stmm{s,t} minis, t}3 {stf + + + 1 

K-ft s, t = 1 1 , 

' ' 2 6 4 6 24 6 45 

s, t E [0, 1], and eigenvalues 

A|^^ = (7^^)-^ z = l,2,..., 

were obtained. 

3.4.4 Multivariate Anderson-Darling processes 

The tensor product of Anderson-Darling processes A^'^\t), t G [0, l]*^, is a zero-mean 
Gaussian random function A^'^^t), t G [0, 1]*^ with covariance function 

r- r +\ TT min{sz, t«} - s/t, + ^ rn ii 

r=i V -^Kl - si)^/ti{l - ti) 
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The eigenvalues of the corresponding covariance operator are given by 

In the one- dimensional case the Anderson-Darling process coincides in distribution 
with ^l'^'^ , t G [0, 1], and was introduced in [3] in the context of goodness-of- 
fit testing. Anderson and Darling obtained its covariance function and the exact 
spectrum. 

In [57j another multivariate extension of the Anderson-Darling process, defined as 
a zero-mean Gaussian process with the covariance function 

.^// / N f minis, tj — st \ ^, 
'^Ais,t)= ^ =_L^^^_^ , s,tG[0,l],/x>0, 
\^/s{l - s)^t[l-t)J 

is given. 

The eigenvalues of its covariance operator are of the form 

\2 ^ ^^ 7 = 12 

(/. + j-l)(/. + j)'^ ' 

When the parameter /i is positive integer, the random field, defined in such a way 

(more precisely, the square of its L2-norm), is the limit distribution for Cramer-von 

Mises-type statistics. 



100 



Bibliography 



[1] Adler, R. J. (1990). An Introduction to Continuity, Extrema and Related Top- 
ics for General Gaussian Processes. Institute of Mathematical Statistics Lec- 
ture Notes - Monograph Series, 12. Institute of Mathematical Statistics, Hay- 
ward, CA. 

[2] Akaike, H. (1973). Information theory and an extension of the maximum 
likelihood principle. Second International Symposium on Information Theory 
(Tsahkadsor, 1971), Akademiai Kiado, Budapest 267-281. 

[3] Anderson, T. W. and Darhng, D. A. (1952). Asymptotic theory of certain "good- 
ness of fit" criteria based on stochastic processes. Ann. Math. Statistics 23 193- 
212. 

[4] Beghin, L., Nikitin, Ya. and Orsingher, E. (2003) . Exact small ball constants for 
some Gaussian processes under the L^-norm. Zap. Nauchn. Sem. S.-Peterburg. 
Otdel. Mat. Inst. Steklov. (POMI) 298 (2003), Veroyatn. i Stat. 6 5-21, 316; 
translation in J. Math. Sci. (N. Y.) 128:1 (2005) 2493-2502. 

[5] Bellman, R. (1961). Adaptive Control Processes: a guided tour. Princeton Uni- 
versity, Princeton. 

101 



[6] Belomestny, D. and Spokoiny, V. (2007) Spatial aggregation of local likelihood 
estimates with applications to classification. Ann. Statist. 35 2287-2311. 

[7] Blum, J. R., Kiefer, J. and Rosenblatt, M. (1961). Distribution free tests of 
independence based on the sample distribution function. Ann. Math. Statist. 
32:2 485-498. 

[8] Brown, L. D. and Low, M. G. (1992). SupperefRciency and lack of adaptabihty 
in functional estimation. Technical report, Cornell Univ. 

[9] Buslaev, A. P. and Seleznjev, O. V. (1999). On certain extremal problems in the 
theory of approximation of random processes. East J. Approx. 5:4 467-481. 

[10] Butucea, C. (2001). Exact adaptive pointwise estimation on Sobolev classes of 
densities. ESAIM Probab. Statist. 5 1-31. 

[11] Cfzek, P., Hardle, W. and Spokoiny, V. (2009). Adaptive pointwise estimation in 
time-inhomogeneous conditional heteroscedasticity models. Econometrics Jour- 
nal 12:2 248 - 271. 

[12] Csorgo, M. and Horvath, L. (1997). Limit Theorems in Change-point Analysis. 
Wiley, New York. 

[13] Deheuvels, P. (1981). An asymptotic decomposition for multivariate distribution- 
free tests of independence. J. Multivariate Anal. 11:1 102-113. 

[14] Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet 
shrinkage. Biometrica 81 425-455. 



102 



[15] Dugue, D. (1975). Sur des tests d' independence "independants de la loi". C. R. 
Acad. Sci. Paris Ser. A-B 281:24 Aii, A1103-A1104. 

[16] Fan, J. and Gijbels, I. (1995). Data-driven bandwidth selection in local poly- 
nomial fitting: variable bandwidth and spatial adaptation. J. Roy. Statist. Soc. 
Ser. B 57:2 371-394. 

[17] Fan, J. and Gijbels, 1. (1996). Local Polynomial Modelling and Its Applications. 
Monographs on Statistics and Applied Probability, 66. Chapman and Hall, Lon- 
don. 

[18] Fan, J., Farmen, M. and Gijbels, I. (1998). Local maximum likelihood estimation 
and inference. J. R. Stat. Soc. Ser. B Stat. Methodol. 60:3 591-608. 

[19] Fan, J., Zhang, C. and Zhang, J. (2001). Generalized likelihood ratio statistics 
and Wilks phenomenon. Ann. Statist. 29:1 153-193. 

[20] Foi, A. (2005) Anisotropic nonparametric image processing: Theory, algorithms 
and applications. Ph.D. Thesis, Dip. di Matematica, Politecnico di Milano, Milan, 
Italy, www.cs.tut.fi/ lasip. 

[21] Gnedenko, B. V. and Kolmogorov, A. N. (1954). Limit Distributions for Sums 
of Independent Random Variables. Translated and annotated by K. L. Chung. 
With an Appendix by J. L. Doob. Addison- Wesley Pubhshing Company, Inc., 
Cambridge, Mass. {in Russian: GTTI, Moscow-Leningrad, 1949). 



103 



[22] Goldenshluger, A. and Nemirovski, A. (1994). On spatial adaptive estimation 

of nonparametric regression. Research report, Technion- Israel Inst. Technology, 
Haifa, Israel. 

[23] Henze, N. and Nikitin, Ya. Yu. (2000). A new approach to goodness-of-fit testing 
based on the integrated empirical process. J. Nonparametr. Statist. 12:3 391-416. 

[24] Hoeffding, W. (1948). A non-parametric test of independence. Ann. Math. Statis- 
tisc 19 546-557. 

[25] Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. 
Statist. 35 73-101. 

[26] Ibragimov, I. A. and Has'minskii, R. Z. (1981). Statistical Estimation. Asymptotic 
Theory. Applications of Mathematics, 16. Springer- Verlag, New York-Berlin. 

[27] Karhunen, K. (1946). Zur Spektraltheorie stochastischer Prozesse. Ann. Acad. 
Sci. Fennicae, Ser. A. I. Math.-Phys. 34 1-7. 

[28] Karhunen, K. (1947). Uber lineare Methoden in der Wahrscheinlichkeitsrech- 
nung. Ann. Acad. Sci. Fennicae, Ser. A. I. Math.-Phys. 37 3-79. 

[29] Katkovnik, V. Ja. (1979). Linear and nonlinear methods of nonparametric re- 
gression analysis. (Russian) Soviet Automat. Control 5 35-46, 93. 

[30] Katkovnik, V. Ja. (1983). Convergence of hnear and nonhnear nonparametric 
estimates of "kernel" type. Automat. Remote Control 44:4 495-506; translated 
from Avtomat. i Telemekh. (1983) 4 108-120 (Russian). 



104 



[31] Katkovnik, V. Ja. (1985). Nonparametric Identification and Data Smoothing: 
Local Approximation Approach. Nauka, Moscow (Russian). 

[32] Katkovnik, V., Egiazarian, K. and Astola, J. (2006). Local Approximation Tech- 
niques in Signal and Image Processing. Bellingham, WA: SPIE Press. 

[33] Katkovnik, V. and Spokoiny, V. (2008). Spatially adaptive estimation via fitted 
local likelihood techniques. IEEE Trans. Signal Process., 56:3 873-886. 

[34] Koning, A. J. and Protasov, V. (2003). Tail behaviour of Gaussian processes 
with applications to the Brownian pillow. J. Multivariate Anal. 87:2 370-397. 

[35] Kosambi, D. D. (1943). Statistics in functional space. J. Indian Math. Soc. 
(N. S.) 7 76-88. 

[36] Kilhn, Th. and Linde, W. (2002). Optimal series representation of fractional 
Brownian sheets. Bernoulli 8:5 669-696. 

[37] KuUback, S. and Leibler, R. A. (1951). On information and sufficiency. Ann. 
Math. Statistics 22 79-86. 

[38] Lepskii, O. V. (1990). A problem of adaptive estimation in Gaussian white noise. 
(Russian) Teor. Veroyatnost. i Primenen. 35:3 459-470; translation in Theory 
Probab. Appl. 35:3 454-466. 

[39] Lepskii, O. V. (1991). Asymptotic minimax adaptive estimation. I. Upper 
bounds. Optimally adaptive estimates. (Russian). Teor. Veroyatnost. i Prime- 
nen. 36:4 645-659; translation in Theory Probab. Appl. 36:4 682-697. 



105 



[40] Lepskii, O. V. (1992). Asymptotic minimax adaptive estimation. II. Schemes 
without optimal adaptation. Adaptive estimates. (Russian) Teor. Veroyatnost. i 
Primenen. 37:3 468-481; translation in Theory Probab. Appl. 37:3 433-448. 

[41] Lepski, O. V., Mammen, E. and Spokoiny, V. G. (1997). Optimal spatial adapta- 
tion to inhomogeneous smoothness: an approach based on kernel estimates with 
variable bandwidth selectors. Ann. Stat. 25:3 929-947. 

[42] Lepski, O. V. and Spokoiny, V. G. (1997). Optimal pointwise adaptive methods 
in nonparametric estimation. Ann. Stat. 25:6 2512-2546. 

[43] Lifshits, M. A. (1995). Gaussian Random Functions. Mathematics and its Ap- 
plications, 322. Kluwer Academic Publishers, Dordrecht. 

[44] Lifshits, M. A. and Tulyakova, E. V. (2006). Curse of dimensionality in approx- 
imation of random fields. Probab. Math. Statist. 26:1 97-112. 

[45] Loader, C. (1999). Local Regression and Likelihood. Statistics and Computing. 
Springer- Verlag, New York. 

[46] Loeve, M. (1946). Fonctions aleatoires de second ordre. Revue Sci. 84 195-206. 

[47] Mathe, P.(2006). The Lepskii principle revisited. Inverse Problems 22:3 Lll- 
L15. 

[48] Mercurio, D. and Spokoiny, V. (2004). Statistical inference for time- 
inhomogeneous volatility models. Ann. Statist. 32:2 577-602. 



106 



[49] Novak, E. and Wozniakowski, H. (2008). Tmctability of Multivariate Problems. 
Vol. 1: Linear Information. EMS Tracts in Mathematics, 6. European Mathe- 
matical Society (EMS), Ziirich. 

[50] Obukhov, A. M. (1954). Statistical description of continuous fields. (Russian) 
Tr. geophis. Inst. Akad. Nauk SSSR 24(151) 3-42. 

[51] Traub, J. P., Wasilkowski, G. W. and Wozniakowski, H. (1988). Information- 
based Complexity. With contributions by A. G. Werschulz and T. Boult. Com- 
puter Science and Scientific Computing. Academic Press, Inc., Boston, MA. 

[52] Traub, J. P. and Werschulz, A. G. (1998). Complexity and Information. Lezioni 
Lincee. [Lincei Lectures] Cambridge University Press, Cambridge. 

[53] Petrov, V. V. (1975). Sums of Independent Random Variables. Translated from 
the Russian by A. A. Brown. Ergebnisse der Mathematik und ihrer Grenzgebiete, 
Band 82. Springer- Verlag, New York-Heidelberg. 

[54] Petrov, Valentin V. (1995). Limit Theorems of Probability Theory. Sequences of 
Independent Random Variables. Oxford Studies in Probability, 4. Oxford Sci- 
ence Publications. The Clarendon Press, Oxford University Press, New York. 
{in Russian: Nauka, Moscow, 1987). 

[55] Polzehl, J. and Spokoiny, V. (2006). Propagation-separation approach for local 
likelihood estimation. Probab. Theory Relat. Fields 135 335-362. 

[56] Pougachev, V. S. (1953). General theory of the correlations of random functions. 
Izv. Akad. Nauk SSSR, Ser. Math. 17:5 401-420. 



107 



[57] Pycke, J.-R. (2003). Multivariate extensions of the Anderson-Darling process. 
Stat. Probab. Lett. 63:4 387-399. 

[58] Reiss, M., Rozenholc, Y., Cuenod, C.-A. (2009). Pointwise adaptive estimation 
for robust and quantile regression. arXiv:0904.0543vl. 

[59] Ritter, K. (2000). Average-case Analysis of Numerical Problems. Lecture Notes 
in Mathematics 1733. Springer- Verlag, Berlin. 

[60] Sabelfeld, K. (2007). Expansion of random boundary excitations for elliptic 
PDEs. Monte Carlo Methods Appl. 13:5-6 405-453. 

[61] Serdyukova, N. A. (2009). Dependence on the dimension for complexity of ap- 
proximation of random fields. (Russian) Teor. Veroyatnost. i Primenen. 54:2 
256-270; translation in Theory Probab. Appl. (2010) 54:2 272-284. 

[62] Serdyukova, N. A. (2009). Local parametric estimation under noise misspecifica- 
tion in regression. arXiv:0912.4489. 

[63] Spokoiny, V. G. (1998). Estimation of a function with discontinuities via local 
polynomial fit with an adaptive window choice. Ann. Statist. 26:4 1356-1378. 

[64] Spokoiny, V. (2002). Variance estimation for high-dimensional regression models. 
J. Multivariate Anal. 82 111-133. 

[65] Spokoiny, V. (2009). Multiscale local change point detection with applications 
to Value-at-Risk. Ann. Statist. 37 1405-1436. 

[66] Spokoiny, V. and Vial, C. (2009). Parameter tuning in pointwise adaptation 
using a propagation approach. Ann. Statist. 37:5B 2783-2807. 

108 



[67] Traub, J. F. and Werschulz, A. G. (1998). Complexity and information. Lezioni 
Lincee. [Lincei Lectures] Cambridge University Press, Cambridge. 

[68] Triebel, H. (1992). Theory of function spaces. II. Monographs in Mathemat- 
ics, 84. Birkhauser Verlag, Basel. 

[69] Tsybakov, A. B. (1982). Nonparametric signal estimation when there is incom- 
plete information on the noise distribution. Problems of Information Transmis- 
sion 18:2, 116-130. 

[70] Tsybakov, A. B. (1982). Robust estimates of a function. Problems of Information 
Transmission 18:3 190-201. 

[71] Tsybakov, A. B. (1986). Robust reconstruction of functions by the local- 
approximation method. Problems of Information Transmission 22:2 133-146. 

[72] Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer 
Series in Statistics. Springer- Verlag, New York, or Introduction V estimation 
non-paramtrique. (French) [Introduction to nonparametric estimation] Math- 
matiques & Applications (Berhn) [Mathematics & Applications], 41. Springer- 
Verlag, Berhn, 2004. 

[73] van der Vaart, A. W. and Wellner J. A. (1996). Weak Convergence and Empirical 
Processes. With Applications to Statistics. Springer Series in Statistics. Springer- 
Verlag, New York. 

[74] Watson, G. S. (1961). Goodness-of-fit tests on a circle. Biometrika 48 109-114. 



109 



[75] White, H. (1982). Maximum likelihood estimation of misspecified models. Econo- 
metrica 50:1 1-25. 

[76] Wozniakowski, H. (1992). Average case complexity of linear multivariate prob- 
lems. Part 1: Theory. Part 2: Apphcations. J. Complexity 8:4 337-372 and 
373-392. 

[77] Wozniakowski, H. (1994). Tractability and strong tractability of linear multivari- 
ate problems. J. Complexity 10:1 96-128. 

[78] Wozniakowski, H. (1994). Tractability and strong tractability of multivariate 
tensor product problems. J. of Computing and Information 4 1-19. 

[79] Wozniakowski, H. (2006). Tractability of multivariate problems for weighted 
spaces of functions. Approximation and Probability, Banach Center Publ., 72 
407-427. 



110 



Index of notation 
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[x\ 

[x] 

log 

def 

w.r.t. 

E(/3,L) 

Sets 


#{•} 
AnB 

Special functions 

r(-) 
*(•) 

Landau notation 

f{x) = o{g{x)), X xo 
f{x) = 0{g{x)), x^ xo 



greatest integer strictly less than the real number x 

integer part of x 

natural logarithm 

equals by definition 

with respect to 

Holder class of functions 



the empty set 
cardinality of the set {•} 
intersection, {x : x & A and x & B} 



the r -function 

the standard normal distribution function 



means that lim f{x)/g{x) = 

means that < C|(7(a;)| , as x ^ Xq 
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Linecir algebra 



7 , A transpose of the vector 7 or of the matrix A 

Xj (A) j th eigenvalue of A 

Xi{A) , Xmax{A) largest eigenvalue of the symmetric matrix A 

tr(yl) trace of A , the sum of the diagonal elements of square matrix A 

rank(74) rank of A 

dimU dimension of the vector space U 

C{A) column space of A , the space spanned by the columns of A 

II7II L2 vector norm, Euclidean norm 

1 1^4 1 1 induced matrix norm based on L2 vector norm (p. 31) 

A ^ B B — A>z , Lowner partial ordering (p. 32) 

A y A is positive definite, 7^747 > for x 7^ 

A A is nonnegative definite, 7^/17 > 

A~^ inverse of A when A is nonsingular 

det A determinant of a square matrix A 

AiSi B Kronecker product of A and B (p. 40) 

diag(xi, . . . , Xn) n X n matrix with diagonal elements Xi, . . . ,Xn 

and zeros elsewhere 

vec A, if ^4 is an m X n matrix, then vec A is an mn x 1 vector 

formed by writing the columns of A one below the other 
k{A) K2{A) conditional number of the positive definite matrix A, 
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Probability and statistics 



5x Dirac measure on x 

= equality in distribution 

a.s. almost surely 

!{•} indicator of the set {•} 

jV(0, 1) the standard normal distribution 

0(-) density of the distribution A/'(0, 1) 

jV (0, /„) standard normal distribution in 

M [d, E) normal distribution with mean and covariance matrix S 
— argmaxL(0) means that L(0) = maxL(0) 

MSE mean squared risk at a point 

KL(P, Pff) KuUback-Leibler divergence between the measures P and Pe (p. 23) 
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Assumptions 



(£pi) - (£p4) p. 10 

(5) ) p. 22 
(£oc) p. 23 
(W) p. 24 

Propagation conditions (PC) p. 28 

(6) p. 29 
(6i) p. 49 
(SMB) p. 42 
(SMBj) p. 54 

- (i:p4') p. 59-60 

{S.pi.'^) p. 49 
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